Possible race condition leading to a connection reset if worker is gracefully terminating #2315

rbagd · 2024-04-25T13:31:50Z

rbagd
Apr 25, 2024

We have encountered a relatively rare connection error which is probably due to a race condition as uvicorn worker is trying to shutdown.

Here is the setup

Synchronous FastAPI endpoints
uvicorn worker
gunicorn with --max-requests for regularly restarting workers

I can reproduce it with Python 3.11, both uvloop and asyncio, but couldn't reproduce with asyncio and Python 3.12.

To reproduce I launch below app as

gunicorn app:app -w 2 -k uvicorn.workers.UvicornWorker --max-requests 5 --keep-alive 60

# app.py
import time
import random

from fastapi import FastAPI

app = FastAPI()

@app.post("/")
def test():
    time.sleep(random.uniform(0, 0.1))
    return {"status": "ok"}

Then I stress test the application with many concurrent users constantly hitting the API.

# hit.py
import requests
import random
import time

for _ in range(10000):
    requests.post("http://localhost:8000", json={})
    time.sleep(random.uniform(0.5, 1))

for i in {0..20}; do (python hit.py)& done

After some waiting I eventually hit Connection reset by peer errors.

I did some initial investigation into what is happening. Here's a tcpdump for one of these errors which I tried to correlate with some events in the code. It always happens around the time when max-requests is reached and worker is shutting down.

CONNECTION MADE: 2024-04-25T12:11:57.946005 ('127.0.0.1', 43826)     # end of connection_made in httptools_impl.py
SHUTD MAIN LOOP: 2024-04-25T12:11:57.946031                          # just before await self.shutdown(sockets=sockets) in server.py
CONNECTION SHUT: 2024-04-25T12:11:57.946105 ('127.0.0.1', 43826)     # just prior to connection.shutdown() in server.py
CLOSE TRANSPORT: 2024-04-25T12:11:57.946108                          # just before self.transport.close() in httptools_impl.py
CONNECTION LOST: 2024-04-25T12:11:57.946135 ('127.0.0.1', 43826)     # start of connection_lost in httptools_impl.py

It seems that in certain cases worker doesn't shutdown gracefully despite data having just arrived in TCP stack.
After some deep-dive I noticed that every time that the error happens it is true that self.cycle=None in HttpToolsProtocol.shutdown() whenever the error is triggered, and if I am correct the reverse is true as well. It seems to me that adding a block

if self.cycle is None:
    return

into httptools_impl.py or h11_impl.py solves the issue but I am not really sure what this means.

zeha · 2024-05-18T19:48:56Z

zeha
May 18, 2024

Hi,

we are seeing the same(?) problem: when a worker restarts due to max-requests, sometimes a request gets lost. In these cases, RST/ACK can be observed.

Setup:

Python 3.11.8
uvicorn 0.29.0
gunicorn 21.2.0
fastapi 0.110.0
uvloop 0.19.0
h11 0.14.0
httpcore 1.0.4
httptools 0.6.1
gunicorn with workers = 4, -k uvicorn.workers.UvicornWorker -t 30 --graceful-timeout 30
uname -rvm: 5.10.0-29-amd64 #1 SMP Debian 5.10.216-1 (2024-05-03) x86_64 - Debian 11

Our App is an API server with async FastApi endpoints. It does receive relatively largely sized requests (say 2-15K).

I think the larger requests have a better chance of triggering the race.

While trying to repro this, I was trying with max_requests=10 and had a quite high repro chance. After fixing the uvicorn.error logger to write into a file, and also log the autorestart condition, the race was noticeably harder to hit.

I'm also seeing this error message in the error log now: Error while closing socket [Errno 9] Bad file descriptor

0 replies

zeha · 2024-05-18T20:01:33Z

zeha
May 18, 2024

Oh yeah, it's not very rare for us. With max_requests = 10000 and 4 workers, we hit this every few hours :-)

0 replies

zeha · 2024-05-18T20:34:44Z

zeha
May 18, 2024

Repro code:

asgi_sample.py:

async def app(scope, receive, send):
    headers = [(b"content-type", b"text/html")]
    body = b'<html>hi!<br>'
    await send({"type": "http.response.start", "status": 200, "headers": headers})
    await send({"type": "http.response.body", "body": body})

gunicorn invocation:

python3 -m gunicorn.app.wsgiapp  --bind ':9222' --workers 4 --max-requests 10 -k uvicorn.workers.UvicornWorker -D asgi_sample:app

curl script:

#!/bin/bash
REMOTE='192.168.103.39:9222'
echo 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' | \
curl -XPUT 'http://'$REMOTE'/404/404/404/404/404/404/404' \
 -H 'user-agent: python-requests/2.31.0' \
 -H 'sentry-trace: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' \
 -H 'baggage: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE' \
 -H 'x-request-id: FFFFFFFFFFFFFFFFFFFF' \
 -H 'accept-encoding: gzip, deflate' \
 -H 'accept: */*' \
 -H 'content-type: application/json' \
 --data-binary @-
echo

Running while true; do bash repro_vpt.sh ; done prints curl: (56) Recv failure: Connection reset by peer after 30-90sec for me. Sometimes it needs a few tries.

Note that repro_vpt.sh has the IP address of the host running gunicorn in the curl command line.

Having the "useless data" in the curl call seems to help with reproducing, but it is not completely necessary.

0 replies

kliver · 2024-11-05T15:40:21Z

kliver
Nov 5, 2024

Having the same issue using gunicorn + uvicorn + django 4, my workers handle a lot of requests so at peak RPM and --max-requests=8000 (25 containers, 1 worker each, round robin loadbalance) is noticeable, even I've randomly get it sometimes (it reflects in cloudflare as unknow error code 502).

Without --max-requests it works fine excepts I'm trying to avoid memory leaks taking down my workers.

It not seems related to requests duration, headers or content-length is just random

I see this in the logs when max-requests+1 is reached and the issue is triggered
[WARNING] Maximum request limit of 1000 exceeded. Terminating process. Shutting down Error while closing socket [Errno 9] Bad file descriptor

0 replies

zffocussss · 2024-11-12T06:44:58Z

zffocussss
Nov 12, 2024

same issues here

0 replies

zffocussss · 2024-11-12T06:46:13Z

zffocussss
Nov 12, 2024

@rbagd what if you use async endpoints instead of sync endpoints/routes?

0 replies

zffocussss · 2024-11-12T06:51:04Z

zffocussss
Nov 12, 2024

@Kludex can you please have a look at this issue?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition leading to a connection reset if worker is gracefully terminating #2315

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply