Skip to content

uWSGI workers dying in 1.40.0 #2699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fiendish opened this issue Feb 1, 2024 · 35 comments · Fixed by #2738
Closed

uWSGI workers dying in 1.40.0 #2699

fiendish opened this issue Feb 1, 2024 · 35 comments · Fixed by #2738
Assignees

Comments

@fiendish
Copy link

fiendish commented Feb 1, 2024

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.40.0

Steps to Reproduce

I don't have a reproduction yet due to the nature of the issue, for which I apologize, but with 1.40.0 my server processes are silently dying after about 90 minutes of run time. This does not happen with 1.39.2. I just want to get it on your radar that the latest update may be causing issues for some. I wish I had more information, but I don't at this time.

Expected Result

^

Actual Result

^

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 1, 2024
@seb-b
Copy link

seb-b commented Feb 2, 2024

I also had a similar issue with 1.40, it was killing my uwsgi workers, had to downgrade to 1.39. Also wish I had more information to give but there wasn't much to go on. The issue was worse with profiling on (workers would be killed via SIGBUS)

@sentrivana
Copy link
Contributor

sentrivana commented Feb 2, 2024

@fiendish @seb-b Hey folks, thanks for bringing this to our attention. Even if there is no repro, more info could help point us in the right direction -- what kind of apps (Django, Flask, FastAPI, something else?) are you running? Which SDK integrations are active (initing the SDK with debug=True should log this at startup)?

Looking at the changelog from 1.40.0 there's a couple things we could try:

  • turning off metrics collection:
    sentry_sdk.init(
        ... # your usual stuff
        _experiments={
            "enable_metrics": False,
        }
    ) 
  • turning off DB query source capture:
    sentry_sdk.init(
        ... # your usual stuff
        enable_db_query_source=False,
    ) 

@fiendish
Copy link
Author

fiendish commented Feb 2, 2024

Oops, yeah, sorry. In my haste I forgot to include. In my case it's Flask 2.1.3 + uWSGI 2.0.23, and

integrations = [
    FlaskIntegration(),
    LoggingIntegration(
        level=logging.INFO,
        event_level=logging.ERROR,
    )
]

I'll try to find time over the weekend or monday to try those things.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024
@getsantry getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024
@getsantry getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 2 Feb 2, 2024
@gerricom
Copy link

gerricom commented Feb 2, 2024

Same here with Django 3.2 LTS, uWSGI 2.0.23. This are my settings:

sentry_logging = LoggingIntegration(
    level=logging.INFO,  # Capture info and above as breadcrumbs
    event_level=logging.ERROR,  # Send errors as events
)
sentry_settings = {
    "dsn": SENTRY_LIVE_DSN,
    "integrations": [DjangoIntegration(), sentry_logging],
    "send_default_pii": True,
    "enable_tracing": True,
}

@getsantry getsantry bot moved this from Waiting for: Community to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024
@sentrivana
Copy link
Contributor

Thanks @gerricom -- are you able to try out turning metrics and/or DB query source off as outlined here?

@gerricom
Copy link

gerricom commented Feb 2, 2024

Yes, I'd like to try this. But I'll have to wait until it is a bit more calm here in the evening (in ~6hrs or so).

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 2, 2024
@getsantry getsantry bot removed the status in GitHub Issues with 👀 2 Feb 2, 2024
@getsantry getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 2 Feb 2, 2024
@arvinext
Copy link

arvinext commented Feb 2, 2024

We hit the same issues to with Python (version 3.11) Django (version 4.1.10) app running on uWSGI (version 2.0.23) and sentry-sdk 1.40.0. We upgraded from sentry-sdk 1.15.0.

We did not change the sentry_sdk.init, which is

sentry_sdk.init(  # type: ignore[abstract]
        dsn=SENTRY_DSN,
        integrations=[
            sentry_django(),
            sentry_logging(event_level=None),
        ],
        release='myapp',
        environment=settings.APP_ENVIRONMENT,
        attach_stacktrace=True,
        sample_rate=DEFAULT_SAMPLE_RATE,
        shutdown_timeout=DEFAULT_SHUTDOWN_TIMEOUT,
        before_send=alert_utils.before_send,
    )

Our uWSGI Setup:
In our case, we have the following setting to recycle uwsgi workers every 1 hr

max-worker-lifetime= 3600 #seconds

After 1 hr of worker lifetime, we see that the uWSGI master complains that the listen queue is full. Log snippet attached below. The workers never get respawned and we eventually loose the kubernetes pod.

Fri Feb  2 17:53:20 2024 - worker 1 lifetime reached, it was running for 21 second(s)
Fri Feb  2 17:53:20 2024 - worker 2 lifetime reached, it was running for 21 second(s)
[WARNING] 032/175340 (10426) : Server uwsgi/uwsgi-7001 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7002 is DOWN, reason: Layer7 timeout, check duration: 6000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 032/175342 (10426) : Server uwsgi/uwsgi-7000 is DOWN, reason: Layer7 timeout, check duration: 6001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 032/175342 (10426) : backend 'uwsgi' has no server available!
Fri Feb  2 17:59:35 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***
Fri Feb  2 17:59:36 2024 - *** uWSGI listen queue of socket "127.0.0.1:7101" (fd: 3) full !!! (32/32) ***

I was able to reproduce this in our test environment by keeping the max-worker-lifetime= 20 #seconds and hence making them die sooner.. I could see the same behavior as logs above.

Additionally I noticed that the worker processes never went away when I did ps aux. Which makes me think that, supervisord which we use to restart the processes, thought the processes never died.

I also confirmed that, this issue does not occur when we kill by _raise_sigint() based on PSS memory
or when workers get reloaded using reload-on-rss parameter.

I have noticed this to happen only when max-worker-lifetime value kicks-in.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 10, 2024
@xrmx
Copy link
Contributor

xrmx commented Feb 12, 2024

@PAStheLoD thanks for spending time on this. While you are at it, does lazy-apps suffice or do you really need lazy?

@sentrivana
Copy link
Contributor

Thanks @PAStheLoD, this is very helpful. I'd also be interested in whether --lazy-apps has the same effect on your app or if you need --lazy.

My current thoughts on this are that there has to be a way to make this work since the transport worker and profiler work largely the same way and we're not seeing them explode people's apps. One difference (other than the use of threading.Event) that I see in the metrics code vs. other places where we have background threads is that in metrics we start the thread right off the bat, whereas elsewhere the thread is only started on demand. I can imagine this making a difference in prefork mode. I'll align this.

I'm also wondering if upgrading to the recent uWSGI 2.0.24 might make a difference?

@Tenzer
Copy link

Tenzer commented Feb 13, 2024

We were also hit by uWSGI workers getting stuck, presumably due to changes in sentry-sdk - although we did try to revert upgrades without much luck. In the end we ended up migrating from uWSGI to Unit and it's been rock solid for us. Just in case anybody else also get frustrated with uWSGI...

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 13, 2024
@sentrivana
Copy link
Contributor

sentrivana commented Feb 13, 2024

Everyone, we've just released 1.40.4 with a tentative fix.

It'd greatly help us out if you could try it out if you're affected by this issue and report back. What you need to do is install sentry-sdk 1.40.4, add the following to your init, and run with your usual uWSGI config.

sentry_sdk.init(
    ... # your usual stuff
    # Metrics collection is still off by default in 1.40.4 if you're under uWSGI,
    # but this can be overridden by
    _experiments={
        "force_enable_metrics": True,
    }
) 

Caveat: The fix in 1.40.4 should work as long as you're not manually emitting any metrics (with sentry_sdk.metrics.<metric_type>(...)) on startup (on demand, e.g. on request, is fine). In that case the only known fix so far is to run uWSGI with --lazy-apps to disable preforking mode. We're looking into addressing this as well without requiring --lazy-apps.

Also please note that for the SDK to not break in other unexpected ways uWSGI needs to be run with --enable-threads.

@natano
Copy link

natano commented Feb 14, 2024

I can reproduce the issue.

uwsgi invocation with a.py from @PAStheLoD:

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads

The segfault is caused by uwsgi not handling threads correctly with default arguments.

When forking python with multiple threads running the fork hooks PyOS_BeforeFork, PyOS_AfterFork_Child and PyOS_AfterFork_Parent have to be run before/after fork. A correct fork sequence looks something like this. uwsgi runs none of those by default. It seems like this is by design, as uwsgi assumes that wsgi apps are single threaded (source).

To make the segfault disappear uwsgi has to be called with --enable-threads (initialize the GIL) and --py-call-uwsgi-fork-hooks (run PyOS_AfterFork_Child in the worker after fork).

uwsgi -w a:flask_app --need-app --http-socket 127.0.0.1:8080 -p 4 --enable-threads --py-call-uwsgi-fork-hooks

Interestingly this behaviour is still not fully correct: Only PyOS_AfterForkChild is called. PyOS_BeforeFork and PyOS_AfterFork_Parent are not called. I'm not sure if that can lead to segfaults too, but in any case not all hooks registered by os.register_at_fork are called when the worker is forked. (The before and after_in_parent hooks are not called.)

Not sure if there is any viable workaround for the sentry SDK, as the corruption happens outside of its control.

IMHO this is not a sentry issue, but caused by unsafe defaults in uwsgi. I wish uwsgi would pick safer defaults.


Used software versions:
Linux 6.1.71-1-lts (arch linux kernel build)
Python 3.11.8 (arch linux PKGBUILD, but rebuilt with debug symbols)
uwsgi 2.0.23 and 2.0.24
sentry_sdk 1.40.2

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 14, 2024
@sentrivana
Copy link
Contributor

sentrivana commented Feb 14, 2024

Thanks a lot for looking into this @natano!

I think the best we can do on the SDK side moving forward is to issue a warning on startup if we detect we're in non-lazy mode and --py-call-uwsgi-fork-hooks is not on, just like we do currently if you're running uWSGI without --enable-threads.

The fix in 1.40.4 should work even without --py-call-uwsgi-fork-hooks in most cases because with that release we spawn all threads on demand BUT if you're unlucky and a thread is spawned at startup before forking, you might run into this again.

So to sum up for future reference, if you run into this issue:

  • run uWSGI with both --enable-threads and --py-call-uwsgi-fork-hooks
  • OR run uWSGI with --lazy-apps (or --lazy, but this is discouraged) and --enable-threads

@thedanfields
Copy link

To add some extra seasoning to this issue.

We saw failures in our AWS Lambda executors after moving to 1.40.3:

Traceback (most recent call last):
  < ... omitted ... >
  File "/var/task/sentry_sdk/integrations/threading.py", line 56, in sentry_start
    return old_start(self, *a, **kw)
  File "/var/lang/lib/python3.9/threading.py", line 899, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

@sentrivana
Copy link
Contributor

@thedanfields I'd wager a guess that's a AWS specific issue - reminds me of #2632 without the upgrading to Python 3.12 part. I wonder if we've hit some kind of AWS Lambda limit on the number of threads since the SDK is now spawning one additional thread.

Could you open a new issue with this please?

@thedanfields
Copy link

@sentrivana , happy to oblige

I created a new issue as requested.

untitaker added a commit to getsentry/snuba that referenced this issue Feb 16, 2024
it seems getsentry/sentry-python#2699 is
closed, and the outcome is:

* deadlock is fixed
* segfault is caused by improper uwsgi settings, and the SDK should warn
  if those are present

I ran `snuba api` locally and the warning doesn't trigger, so I assume
we're good. But also, I think we may have only encountered the deadlock,
not the segfault.
untitaker added a commit to getsentry/snuba that referenced this issue Feb 16, 2024
it seems getsentry/sentry-python#2699 is
closed, and the outcome is:

* deadlock is fixed
* segfault is caused by improper uwsgi settings, and the SDK should warn
  if those are present

I ran `snuba api` locally and the warning doesn't trigger, so I assume
we're good. But also, I think we may have only encountered the deadlock,
not the segfault.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy