Possible race condition leading to a connection reset if worker is gracefully terminating #2315
Replies: 7 comments
-
Hi, we are seeing the same(?) problem: when a worker restarts due to max-requests, sometimes a request gets lost. In these cases, RST/ACK can be observed. Setup:
Our App is an API server with async FastApi endpoints. It does receive relatively largely sized requests (say 2-15K). I think the larger requests have a better chance of triggering the race. While trying to repro this, I was trying with I'm also seeing this error message in the error log now: |
Beta Was this translation helpful? Give feedback.
-
Oh yeah, it's not very rare for us. With max_requests = 10000 and 4 workers, we hit this every few hours :-) |
Beta Was this translation helpful? Give feedback.
-
Repro code: asgi_sample.py: async def app(scope, receive, send):
headers = [(b"content-type", b"text/html")]
body = b'<html>hi!<br>'
await send({"type": "http.response.start", "status": 200, "headers": headers})
await send({"type": "http.response.body", "body": body}) gunicorn invocation:
curl script: #!/bin/bash
REMOTE='192.168.103.39:9222'
echo 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' | \
curl -XPUT 'http://'$REMOTE'/404/404/404/404/404/404/404' \
-H 'user-agent: python-requests/2.31.0' \
-H 'sentry-trace: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' \
-H 'baggage: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE' \
-H 'x-request-id: FFFFFFFFFFFFFFFFFFFF' \
-H 'accept-encoding: gzip, deflate' \
-H 'accept: */*' \
-H 'content-type: application/json' \
--data-binary @-
echo Running Note that repro_vpt.sh has the IP address of the host running gunicorn in the curl command line. Having the "useless data" in the curl call seems to help with reproducing, but it is not completely necessary. |
Beta Was this translation helpful? Give feedback.
-
Having the same issue using gunicorn + uvicorn + django 4, my workers handle a lot of requests so at peak RPM and --max-requests=8000 (25 containers, 1 worker each, round robin loadbalance) is noticeable, even I've randomly get it sometimes (it reflects in cloudflare as unknow error code 502). Without --max-requests it works fine excepts I'm trying to avoid memory leaks taking down my workers. It not seems related to requests duration, headers or content-length is just random I see this in the logs when max-requests+1 is reached and the issue is triggered |
Beta Was this translation helpful? Give feedback.
-
same issues here |
Beta Was this translation helpful? Give feedback.
-
@rbagd what if you use async endpoints instead of sync endpoints/routes? |
Beta Was this translation helpful? Give feedback.
-
@Kludex can you please have a look at this issue? |
Beta Was this translation helpful? Give feedback.
-
We have encountered a relatively rare connection error which is probably due to a race condition as
uvicorn
worker is trying to shutdown.Here is the setup
uvicorn
workergunicorn
with--max-requests
for regularly restarting workersI can reproduce it with Python 3.11, both
uvloop
andasyncio
, but couldn't reproduce withasyncio
and Python 3.12.To reproduce I launch below app as
Then I stress test the application with many concurrent users constantly hitting the API.
After some waiting I eventually hit
Connection reset by peer
errors.I did some initial investigation into what is happening. Here's a
tcpdump
for one of these errors which I tried to correlate with some events in the code. It always happens around the time whenmax-requests
is reached and worker is shutting down.It seems that in certain cases worker doesn't shutdown gracefully despite data having just arrived in TCP stack.
After some deep-dive I noticed that every time that the error happens it is true that
self.cycle=None
inHttpToolsProtocol.shutdown()
whenever the error is triggered, and if I am correct the reverse is true as well. It seems to me that adding a blockinto
httptools_impl.py
orh11_impl.py
solves the issue but I am not really sure what this means.Beta Was this translation helpful? Give feedback.
All reactions