uWSGI workers dying in 1.40.0 #2699

fiendish · 2024-02-01T22:21:33Z

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.40.0

Steps to Reproduce

I don't have a reproduction yet due to the nature of the issue, for which I apologize, but with 1.40.0 my server processes are silently dying after about 90 minutes of run time. This does not happen with 1.39.2. I just want to get it on your radar that the latest update may be causing issues for some. I wish I had more information, but I don't at this time.

Expected Result

^

Actual Result

^

seb-b · 2024-02-02T05:44:03Z

I also had a similar issue with 1.40, it was killing my uwsgi workers, had to downgrade to 1.39. Also wish I had more information to give but there wasn't much to go on. The issue was worse with profiling on (workers would be killed via SIGBUS)

sentrivana · 2024-02-02T08:35:16Z

@fiendish @seb-b Hey folks, thanks for bringing this to our attention. Even if there is no repro, more info could help point us in the right direction -- what kind of apps (Django, Flask, FastAPI, something else?) are you running? Which SDK integrations are active (initing the SDK with debug=True should log this at startup)?

Looking at the changelog from 1.40.0 there's a couple things we could try:

turning off metrics collection:

sentry_sdk.init(
    ... # your usual stuff
    _experiments={
        "enable_metrics": False,
    }
)

turning off DB query source capture:

sentry_sdk.init(
    ... # your usual stuff
    enable_db_query_source=False,
)

fiendish · 2024-02-02T09:37:25Z

Oops, yeah, sorry. In my haste I forgot to include. In my case it's Flask 2.1.3 + uWSGI 2.0.23, and

integrations = [
    FlaskIntegration(),
    LoggingIntegration(
        level=logging.INFO,
        event_level=logging.ERROR,
    )
]

I'll try to find time over the weekend or monday to try those things.

gerricom · 2024-02-02T10:13:46Z

Same here with Django 3.2 LTS, uWSGI 2.0.23. This are my settings:

sentry_logging = LoggingIntegration(
    level=logging.INFO,  # Capture info and above as breadcrumbs
    event_level=logging.ERROR,  # Send errors as events
)
sentry_settings = {
    "dsn": SENTRY_LIVE_DSN,
    "integrations": [DjangoIntegration(), sentry_logging],
    "send_default_pii": True,
    "enable_tracing": True,
}

sentrivana · 2024-02-02T10:36:57Z

Thanks @gerricom -- are you able to try out turning metrics and/or DB query source off as outlined here?

gerricom · 2024-02-02T10:47:44Z

Yes, I'd like to try this. But I'll have to wait until it is a bit more calm here in the evening (in ~6hrs or so).

arvinext · 2024-02-02T23:56:57Z

We hit the same issues to with Python (version 3.11) Django (version 4.1.10) app running on uWSGI (version 2.0.23) and sentry-sdk 1.40.0. We upgraded from sentry-sdk 1.15.0.

We did not change the sentry_sdk.init, which is

sentry_sdk.init(  # type: ignore[abstract]
        dsn=SENTRY_DSN,
        integrations=[
            sentry_django(),
            sentry_logging(event_level=None),
        ],
        release='myapp',
        environment=settings.APP_ENVIRONMENT,
        attach_stacktrace=True,
        sample_rate=DEFAULT_SAMPLE_RATE,
        shutdown_timeout=DEFAULT_SHUTDOWN_TIMEOUT,
        before_send=alert_utils.before_send,
    )

Our uWSGI Setup:
In our case, we have the following setting to recycle uwsgi workers every 1 hr

max-worker-lifetime= 3600 #seconds

After 1 hr of worker lifetime, we see that the uWSGI master complains that the listen queue is full. Log snippet attached below. The workers never get respawned and we eventually loose the kubernetes pod.

Fri Feb  2 17:53:20 2024 - worker 1 lifetime reached, it was running for 21 second(s)
Fri Feb  2 17:53:20 2024 - worker 2 lifetime reached, it was running for 21 second(s)
[WARNING] 032/175340 (10426) : Server uwsgi/uwsgi-7001 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7002 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7000 is DOWN, reason: Layer7 timeout, check duration: 6001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 032/175342 (10426) : backend 'uwsgi' has no server available!
Fri Feb  2 17:59:35 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***
Fri Feb  2 17:59:36 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***

I was able to reproduce this in our test environment by keeping the max-worker-lifetime= 20 #seconds and hence making them die sooner.. I could see the same behavior as logs above.

Additionally I noticed that the worker processes never went away when I did ps aux. Which makes me think that, supervisord which we use to restart the processes, thought the processes never died.

I also confirmed that, this issue does not occur when we kill by _raise_sigint() based on PSS memory
or when workers get reloaded using reload-on-rss parameter.

I have noticed this to happen only when max-worker-lifetime value kicks-in.

xrmx · 2024-02-12T08:34:43Z

@PAStheLoD thanks for spending time on this. While you are at it, does lazy-apps suffice or do you really need lazy?

sentrivana · 2024-02-12T17:55:28Z

Thanks @PAStheLoD, this is very helpful. I'd also be interested in whether --lazy-apps has the same effect on your app or if you need --lazy.

My current thoughts on this are that there has to be a way to make this work since the transport worker and profiler work largely the same way and we're not seeing them explode people's apps. One difference (other than the use of threading.Event) that I see in the metrics code vs. other places where we have background threads is that in metrics we start the thread right off the bat, whereas elsewhere the thread is only started on demand. I can imagine this making a difference in prefork mode. I'll align this.

I'm also wondering if upgrading to the recent uWSGI 2.0.24 might make a difference?

Tenzer · 2024-02-13T09:57:12Z

We were also hit by uWSGI workers getting stuck, presumably due to changes in sentry-sdk - although we did try to revert upgrades without much luck. In the end we ended up migrating from uWSGI to Unit and it's been rock solid for us. Just in case anybody else also get frustrated with uWSGI...

sentrivana · 2024-02-13T12:30:07Z

Everyone, we've just released 1.40.4 with a tentative fix.

It'd greatly help us out if you could try it out if you're affected by this issue and report back. What you need to do is install sentry-sdk 1.40.4, add the following to your init, and run with your usual uWSGI config.

sentry_sdk.init(
    ... # your usual stuff
    # Metrics collection is still off by default in 1.40.4 if you're under uWSGI,
    # but this can be overridden by
    _experiments={
        "force_enable_metrics": True,
    }
)

Caveat: The fix in 1.40.4 should work as long as you're not manually emitting any metrics (with sentry_sdk.metrics.<metric_type>(...)) on startup (on demand, e.g. on request, is fine). In that case the only known fix so far is to run uWSGI with --lazy-apps to disable preforking mode. We're looking into addressing this as well without requiring --lazy-apps.

Also please note that for the SDK to not break in other unexpected ways uWSGI needs to be run with --enable-threads.

natano · 2024-02-14T17:37:45Z

I can reproduce the issue.

uwsgi invocation with a.py from @PAStheLoD:

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads

The segfault is caused by uwsgi not handling threads correctly with default arguments.

When forking python with multiple threads running the fork hooks PyOS_BeforeFork, PyOS_AfterFork_Child and PyOS_AfterFork_Parent have to be run before/after fork. A correct fork sequence looks something like this. uwsgi runs none of those by default. It seems like this is by design, as uwsgi assumes that wsgi apps are single threaded (source).

To make the segfault disappear uwsgi has to be called with --enable-threads (initialize the GIL) and --py-call-uwsgi-fork-hooks (run PyOS_AfterFork_Child in the worker after fork).

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads --py-call-uwsgi-fork-hooks

Interestingly this behaviour is still not fully correct: Only PyOS_AfterForkChild is called. PyOS_BeforeFork and PyOS_AfterFork_Parent are not called. I'm not sure if that can lead to segfaults too, but in any case not all hooks registered by os.register_at_fork are called when the worker is forked. (The before and after_in_parent hooks are not called.)

Not sure if there is any viable workaround for the sentry SDK, as the corruption happens outside of its control.

IMHO this is not a sentry issue, but caused by unsafe defaults in uwsgi. I wish uwsgi would pick safer defaults.

Used software versions:
Linux 6.1.71-1-lts (arch linux kernel build)
Python 3.11.8 (arch linux PKGBUILD, but rebuilt with debug symbols)
uwsgi 2.0.23 and 2.0.24
sentry_sdk 1.40.2

sentrivana · 2024-02-14T18:03:42Z

Thanks a lot for looking into this @natano!

I think the best we can do on the SDK side moving forward is to issue a warning on startup if we detect we're in non-lazy mode and --py-call-uwsgi-fork-hooks is not on, just like we do currently if you're running uWSGI without --enable-threads.

The fix in 1.40.4 should work even without --py-call-uwsgi-fork-hooks in most cases because with that release we spawn all threads on demand BUT if you're unlucky and a thread is spawned at startup before forking, you might run into this again.

So to sum up for future reference, if you run into this issue:

run uWSGI with both --enable-threads and --py-call-uwsgi-fork-hooks
OR run uWSGI with --lazy-apps (or --lazy, but this is discouraged) and --enable-threads

thedanfields · 2024-02-15T15:02:20Z

To add some extra seasoning to this issue.

We saw failures in our AWS Lambda executors after moving to 1.40.3:

Traceback (most recent call last):
  < ... omitted ... >
  File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
    return old_start(self, *a, **kw)
  File "/var/lang/lib/python3.9/threading.py", line 899, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

sentrivana · 2024-02-15T15:18:17Z

@thedanfields I'd wager a guess that's a AWS specific issue - reminds me of #2632 without the upgrading to Python 3.12 part. I wonder if we've hit some kind of AWS Lambda limit on the number of threads since the SDK is now spawning one additional thread.

Could you open a new issue with this please?

thedanfields · 2024-02-15T16:23:51Z

@sentrivana , happy to oblige

I created a new issue as requested.

it seems getsentry/sentry-python#2699 is closed, and the outcome is: * deadlock is fixed * segfault is caused by improper uwsgi settings, and the SDK should warn if those are present I ran `snuba api` locally and the warning doesn't trigger, so I assume we're good. But also, I think we may have only encountered the deadlock, not the segfault.

getsantry bot added the Waiting for: Product Owner label Feb 1, 2024

getsantry bot added this to GitHub Issues with 👀 2 Feb 1, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 1, 2024

getsantry bot removed the Waiting for: Product Owner label Feb 2, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot added the Waiting for: Product Owner label Feb 2, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024

sentrivana added Waiting for: Community and removed Waiting for: Product Owner labels Feb 2, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot added Waiting for: Product Owner and removed Waiting for: Community labels Feb 2, 2024

getsantry bot moved this from Waiting for: Community to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot removed the Waiting for: Product Owner label Feb 2, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot added the Waiting for: Product Owner label Feb 2, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024

sentrivana added Waiting for: Community and removed Waiting for: Product Owner labels Feb 2, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 2 Feb 2, 2024

getsantry bot removed the Waiting for: Community label Feb 2, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 10, 2024

getsantry bot removed the Waiting for: Product Owner label Feb 12, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 12, 2024

getsantry bot added the Waiting for: Product Owner label Feb 13, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 13, 2024

getsantry bot removed the Waiting for: Product Owner label Feb 13, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 13, 2024

getsantry bot added the Waiting for: Product Owner label Feb 14, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 14, 2024

getsantry bot removed the Waiting for: Product Owner label Feb 14, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Feb 14, 2024

sentrivana mentioned this issue Feb 14, 2024

ref(uwsgi): Warn if uWSGI is set up without proper thread support #2738

Merged

sentrivana closed this as completed in #2738 Feb 15, 2024

sentrivana mentioned this issue Feb 15, 2024

AWS Lambda failing w/ RuntimeError: can't start new thread on 1.40.3 #2741

Closed

dineshtrivedi mentioned this issue Feb 15, 2024

max-worker-lifetime isn't recycling workers unbit/uwsgi#1760

Open

untitaker mentioned this issue Feb 16, 2024

fix: Upgrade SDK and reenable DDM getsentry/snuba#5557

Merged

zaheerabbas-prodigal mentioned this issue Mar 1, 2024

uWSGI worker throwing Runtime Exception when run with py-call-uwsgi-fork-hooks flag #2780

Closed

diox mentioned this issue Mar 19, 2024

Update to latest sentry_sdk mozilla/addons#1994

Closed

stephanie-anderson removed the Type: Bug label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

uWSGI workers dying in 1.40.0 #2699

uWSGI workers dying in 1.40.0 #2699

fiendish commented Feb 1, 2024

seb-b commented Feb 2, 2024

Uh oh!

sentrivana commented Feb 2, 2024 •

edited

Loading

Uh oh!

fiendish commented Feb 2, 2024

Uh oh!

gerricom commented Feb 2, 2024

Uh oh!

sentrivana commented Feb 2, 2024

Uh oh!

gerricom commented Feb 2, 2024

Uh oh!

arvinext commented Feb 2, 2024 •

edited

Loading

Uh oh!

xrmx commented Feb 12, 2024

Uh oh!

sentrivana commented Feb 12, 2024

Uh oh!

Tenzer commented Feb 13, 2024

Uh oh!

sentrivana commented Feb 13, 2024 •

edited

Loading

Uh oh!

natano commented Feb 14, 2024

Uh oh!

sentrivana commented Feb 14, 2024 •

edited

Loading

Uh oh!

thedanfields commented Feb 15, 2024

Uh oh!

sentrivana commented Feb 15, 2024

Uh oh!

thedanfields commented Feb 15, 2024

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

uWSGI workers dying in 1.40.0 #2699

uWSGI workers dying in 1.40.0 #2699

Comments

fiendish commented Feb 1, 2024

How do you use Sentry?

Version

Steps to Reproduce

Expected Result

Actual Result

seb-b commented Feb 2, 2024

Uh oh!

sentrivana commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fiendish commented Feb 2, 2024

Uh oh!

gerricom commented Feb 2, 2024

Uh oh!

sentrivana commented Feb 2, 2024

Uh oh!

gerricom commented Feb 2, 2024

Uh oh!

arvinext commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrmx commented Feb 12, 2024

Uh oh!

sentrivana commented Feb 12, 2024

Uh oh!

Tenzer commented Feb 13, 2024

Uh oh!

sentrivana commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natano commented Feb 14, 2024

Uh oh!

sentrivana commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedanfields commented Feb 15, 2024

Uh oh!

sentrivana commented Feb 15, 2024

Uh oh!

thedanfields commented Feb 15, 2024

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

sentrivana commented Feb 2, 2024 •

edited

Loading

arvinext commented Feb 2, 2024 •

edited

Loading

sentrivana commented Feb 13, 2024 •

edited

Loading

sentrivana commented Feb 14, 2024 •

edited

Loading