[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server #10635

tomeras91 · 2024-11-25T12:48:46Z

Currently, tokenization of requests in the OpenAI compatible server is done synchronously on the process running the server. Usually, this is not a problem, but when dealing with long sequence lengths (256K), tokenization can take a few hundred milliseconds. Combining long requests with high loads causes the server to be non-responsive as it spends a lot of time tokenizing requests synchronously and sequentially, blocking the asyncio event loop.

This PR aims to unblock the event loop by using a thread pool to tokenize requests. This following #3512 which also introduced parallel async tokenization but was closed and this feature was never delivered.

github-actions · 2024-11-25T12:48:59Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/commit_id.py

robertgshaw2-neuralmagic · 2024-11-25T12:55:13Z

LGTM - as a sanity check, can you run a quick ShareGPT benchmark on a small model? If you need instructions for this, I can share commands.

vllm/entrypoints/openai/serving_engine.py

…ver by using threadpool for tokenization Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2024-11-25T16:22:04Z

LGTM - as a sanity check, can you run a quick ShareGPT benchmark on a small model? If you need instructions for this, I can share commands.

Sure. Here are the results

Model: Qwen/Qwen2.5-1.5B-Instruct
Hardware: single H100

branch	throughput (requests/s)	mean TTFT (msec)
main (commit `ed46f14`)	51.33	6645.88
pr (commit `980fff8`)	51.62	6251.77

So results are pretty much identical. Maybe a bit better with the PR but I guess the diffs are too small to be meaningful.

Other than that, I think the CI failures are not related to the changes in this PR

njhill

Thanks @tomeras91.

Another thought is that it's likely optimal to only dispatch to another thread if the length of text / number of tokens is above some threshold, otherwise it may be slightly detrimental.

But like @robertgshaw2-neuralmagic said, we should make sure to benchmark this anyhow.

njhill · 2024-11-25T16:17:31Z

vllm/entrypoints/openai/serving_engine.py

+        self._tokenize_prompt_inputs_async = make_async(
+            self._tokenize_prompt_inputs)
+        self._tokenize_prompt_input_or_inputs_async = make_async(
+            self._tokenize_prompt_input_or_inputs)


I don't think these will work as intended. The methods return a generator which will be done asynchronously but the actual work done to generate the outputs will still be done on the asyncio event loop while iterating.

We'll probably need to think of another way to arrange this, possibly we can change these methods to just return lists rather than generators.

that's a good catch. It really didn't work for /v1/completions which uses these methods. It did work for /v1/chat/completions that uses _tokenize_prompt_input which doesn't return a generator and actually does the tokenization work.
Anyway, fixed by making _tokenize_prompt_input_or_inputs return a List as you suggested.

Will post updated benchmarks and something that shows this works shortly

njhill · 2024-11-25T16:25:54Z

@tomeras91 presumably you'll also test with your >200k token workload to ensure that it helps?

…in self._tokenize_prompt_input Signed-off-by: Tomer Asida <[email protected]>

…c will actually make execution run in thread and not just generator creation Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2024-11-25T19:41:57Z

Updated benchmarks

Now both on /v1/completions and on /v1/chat/completions

Model: Qwen/Qwen2.5-1.5B-Instruct
Hardware: single H100
Serve run command: vllm serve Qwen/Qwen2.5-1.5B-Instruct

Endpoint: v1/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	52.88	6406.30	23.09	189265
pr (commit `b61a04f`)	52.92	5985.69	21.29	191136

Conclusion: Pretty much the same (very slightly better performance in PR even though more tokens were generated)

Endpoint: v1/chat/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --endpoint /v1/chat/completions --backend openai-chat

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	57.63	5946.73	23.73	171968
pr (commit `b61a04f`)	55.52	5972.38	23.50	172531

Conclusion: Pretty much the same (slightly worse performance in PR, maybe due to more generated tokens)

Anyway, I would expect this PR to affect TTFT much more than it affects TPOT, but it seems like a very insignificant change regardless. If at all, the effect is more significant for the completions endpoint, and is an improvement.

RE something to show this actually solves the problem:

Here is the code I used:

import requests
import threading
from threading import Thread
import time

def send_request(url, body):
    # print(f"Sending from thread {threading.get_ident()}")
    start = time.time()
    res = requests.post(url, json=body)
    end = time.time()
    # print(f"got res in thread {threading.get_ident()} after {1000*(end-start):.2f} ms with status {res.status_code}. content: {res.json() if res.status_code!=200 else 'success'}")
    return res

model = "Qwen/Qwen2.5-1.5B-Instruct"

chat_url = "http://localhost:8000/v1/chat/completions"
chat_body = {'model': model,
             'messages': [{'role': 'user', 'content': ' '.join(['A']*300_000)}],
            }

completions_url = "http://localhost:8000/v1/completions"
completions_body = {'model': model,
                    'prompt': ' '.join(['A']*300_000),
                   }
completions_body_mult = {'model': model,
                         'prompt': [' '.join(['A']*300_000)]*2,
                        }

for _ in range(60):
    # Thread(target=send_request, kwargs={'url': completions_url, 'body': completions_body}).start()
    Thread(target=send_request, kwargs={'url': completions_url, 'body': completions_body_mult}).start()
    # Thread(target=send_request, kwargs={'url': chat_url, 'body': chat_body}).start()
start = time.time()
res = requests.get("http://localhost:8000/health")
end = time.time()
print(f"######################### healthcheck took {end-start} secs #########################")

The idea is to send many long context requests in close succession (using multiple threads), and then check how long /health takes to respond. If the event loop is blocked, we would expect long response time of the healthcheck. As you can see, I checked it for both v1/completions and v1/chat/completions, and also for multiple inputs for the completions endpoint. Note that these requests are longer than the supported max model length (32K) so no work is actually done by the model and the requests are blocked after tokenization.

The response time of /health when the server is not dealing with any requests was about 2.3 msecs

branch	endpoint	health response time (msec)
main (commit `ed46f14`)	`v1/completions`	17848.1
main (commit `ed46f14`)	`v1/completions` multiple inputs	14285.4
main (commit `ed46f14`)	`v1/chat/completions`	17159.3
pr (commit `b61a04f`)	`v1/completions`	34.6
pr (commit `b61a04f`)	`v1/completions` multiple inputs	41.0
pr (commit `b61a04f`)	`v1/chat/completions`	16.4

The response time is much shorter in the PR. We do see that the healthcheck takes a bit longer to respond under this long context requests load, which can be attributed to the fact that tokenizing many such long requests does take a bit CPU

njhill · 2024-11-26T00:31:29Z

Thanks @tomeras91, it looks great now. One minor concern I realized is that the HF tokenizers technically aren't thread-safe. I think in practice this is only the case if things like truncation and/or padding are being done, but that can be true for us (if truncate parameter is specified in the request).

Despite this it might still not be a problem given that I don't think the GIL is really released. But I'm not sure whether we should put an explicit lock around the calls to the tokenizer or create a tokenizer per thread. The former would likely be less invasive and could be done by wrapping/replacing the __call__ method when the tokenizer is created. It would probably be best to re-run the perf test after doing that.

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 · 2024-11-26T16:20:38Z

Thanks @tomeras91, it looks great now. One minor concern I realized is that the HF tokenizers technically aren't thread-safe. I think in practice this is only the case if things like truncation and/or padding are being done, but that can be true for us (if truncate parameter is specified in the request).

Despite this it might still not be a problem given that I don't think the GIL is really released. But I'm not sure whether we should put an explicit lock around the calls to the tokenizer or create a tokenizer per thread. The former would likely be less invasive and could be done by wrapping/replacing the __call__ method when the tokenizer is created. It would probably be best to re-run the perf test after doing that.

@njhill - That's another great catch. Without putting an explicit lock (commit b61a04f), sending multiple requests in parallel, some with truncate_prompt_tokens and some without causes error code 500 failures. I guess this is some sort of similar issue to huggingface/tokenizers#537, dealing with the tokenizer's mutable truncation state, exactly as you pointed out. main (commit ed46f14) doesn't fail in that situation so that's indeed a regression.

Fix - explicit lock

As you suggested, I fixed this by wrapping the tokenizer's __call__ with an explicit threading lock. This required a bit more changes in the code - I'm not sure the choices I made are the best and would like some feedback - commit e4cb992. Basically what I did is to add another flag to get_tokenizer called use_threadsafe and to propagate it all the way to init_tokenizer_from_configs. When this flag is set to True, the tokenizer is wrapped as a ThreadsafeTokenizer, similar to the existing CachedTokenizer. The tokenizer used for encoding requests in the OpenAI compatible server is the one held in the MQLLMEngineClient, so now this client initializes a threadsafe tokenizer. I again sent multiple requests, some with truncation and some without, and validated that I get no 500 errors (I used similar code what I posted in #10635 (comment), but edited the request body in each thread to include 'truncate_prompt_tokens': 1000 if random.random()>0.5 else None).

Benchmarks

I ran the same sort of benchmarks as in #10635 (comment)

Model: Qwen/Qwen2.5-1.5B-Instruct
Hardware: single H100
Serve run command: vllm serve Qwen/Qwen2.5-1.5B-Instruct

Endpoint: v1/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	52.88	6406.30	23.09	189265
pr (commit `b61a04f`)	52.92	5985.69	21.29	191136
pr threadsafe (commit `e59cc81`)	52.75	6090.83	24.06	190028

Endpoint: v1/chat/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --endpoint /v1/chat/completions --backend openai-chat

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	57.63	5946.73	23.73	171968
pr (commit `b61a04f`)	55.52	5972.38	23.50	172531
pr threadsafe (commit `e59cc81`)	55.83	5782.09	25.14	171961

Conclusion: I think the differences are still within the randomness of the benchmarking script, although we do see a slight increase in TPOT.

/health response times:
The response time of /health when the server is not dealing with any requests was about 2.3 msecs

branch	endpoint	health response time (msec)
main (commit `ed46f14`)	`v1/completions`	17848.1
main (commit `ed46f14`)	`v1/completions` multiple inputs	14285.4
main (commit `ed46f14`)	`v1/chat/completions`	17159.3
pr (commit `b61a04f`)	`v1/completions`	34.6
pr (commit `b61a04f`)	`v1/completions` multiple inputs	41.0
pr (commit `b61a04f`)	`v1/chat/completions`	16.4
pr threadsafe (commit `e59cc81`)	`v1/completions`	2.7
pr threadsafe (commit `e59cc81`)	`v1/completions` multiple inputs	3.0
pr threadsafe (commit `e59cc81`)	`v1/chat/completions`	4.8

So even better than before in the buggy non-threadsafe version

njhill · 2024-11-26T18:19:19Z

Thanks @tomeras91 this looks great. Unfortunately it looks like there's one more problem (seen via failing CI test)...

The tokenizer gets included in logitsprocessors used for guided decoding. We currently initialize these in the front-end process and they are pickled and sent to the back-end .. which is a bit crazy but needed to avoid their creation blocking the GIL in the main proc (unpickling them is faster). There are plans to change this but in the meantime this change breaks things because the lock can't be picked.

Ideas to fix this:

Use (potentially separately constructed) non-threadsafe tokenizer for these logits processors
Implement this in your threadsafe tokenizer class https://docs.python.org/3/library/pickle.html#handling-stateful-objects
Another idea I thought of which might be simpler overall - use a custom ThreadPoolExecutor with max_workers=1 for the make_async calls. This would avoid needing the lock provided that all the tokenize calls for the tokenizer are done async

I think I like (2) the least ... just makes this workaround more complex/convoluted.

tomeras91 · 2024-11-26T18:38:15Z

Yeah I saw the failing CI but didn't get to fixing it yet.
I thought of fixing it by trying to extract the underlying non-threadsafe tokenizer before sending it to outlines. I saw there's already a "adapt tokenizer" function so I can add a bit more logic there.

I think I'll try your idea (3) as well. It makes a lot of sense since, because of the required lock, there's no real gain in running the tokenizer in multiple threads and it will remove the need for the lock.

I also want to lock some of these behaviors (short healthcheck response time under heavy long context load, no failing with mix of truncated and not truncated requests) in unittests. Still don't know how hard will that be but given there are many tests that use an OpenAI server I think it shouldn't be too hard

tomeras91 · 2024-11-26T18:53:10Z

yep. Making the tokenization async with a ThreadPoolExecutor(max_workers=1) works perfectly 🙂
I'll update the PR and add new benchmarks and hopefully some unittests

… No need for threadsafe tokenizer anymore since all tokenization happens on the same thread Signed-off-by: Tomer Asida <[email protected]>

…an be sent concurrently and (2) that /health response time is short under high tokenization load Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

tomeras91

@njhill - Now we use a threadpool with max_workers=1.

Also added tests. I put them in a new file because I couldn't find a better place where they belong. If you want me to move them elsewhere, that's not a problem of course

tomeras91 · 2024-11-27T12:42:55Z

tests/entrypoints/openai/test_async_tokenization.py

+        }),
+    ],
+)
+async def test_with_and_without_truncate(


This fails if we use a threadpool with max_workers>1 for tokenization because the tokenizer is not threadsafe. It passes both on main and on the PR

tomeras91 · 2024-11-27T12:43:30Z

tests/entrypoints/openai/test_async_tokenization.py

+        }),
+    ],
+)
+async def test_healthcheck_response_time(


This fails on main and passes on the PR. It validates the fix actually works and the server event loop isn't blocked on tokenization

The embedding endpoint is not tested since it requires launching the server again with --task embedding and I wanted to save some time. I think that test will be redundant anyway, since the embedding endpoint uses the same tokenization code paths as completions and chat

tomeras91 · 2024-11-27T12:52:38Z

Updated (hopefully last) Benchmarks

I ran the same sort of benchmarks as in #10635 (comment)

Model: Qwen/Qwen2.5-1.5B-Instruct
Hardware: single H100
Serve run command: vllm serve Qwen/Qwen2.5-1.5B-Instruct

Endpoint: v1/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	52.88	6406.30	23.09	189265
pr (commit `b61a04f`)	52.92	5985.69	21.29	191136
pr threadsafe (commit `e59cc81`)	52.75	6090.83	24.06	190028
pr single thread threadpool (commit `ff1d6a9`)	52.66	6092.80	23.57	190356

Endpoint: v1/chat/completions
(benchmark run command: python3 benchmark_serving.py --model $MODEL --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --endpoint /v1/chat/completions --backend openai-chat

branch	throughput (requests/s)	mean TTFT (msec)	mean TPOT (msec)	Total generated tokens
main (commit `ed46f14`)	57.63	5946.73	23.73	171968
pr (commit `b61a04f`)	55.52	5972.38	23.50	172531
pr threadsafe (commit `e59cc81`)	55.83	5782.09	25.14	171961
pr single thread threadpool (commit `ff1d6a9`)	55.68	5851.25	24.94	171941

same conclusions - there's no hit to performance

/health response times are <~5 ms. I also added a unittest to make sure they response time under high tokenization load are not too different than under no load.

njhill

Thanks @tomeras91!

mergify bot added the frontend label Nov 25, 2024

tomeras91 changed the title ~~[Frontend] don't block GIL in tokenization (preprocess) in OpenAI compatible server~~ [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server Nov 25, 2024

robertgshaw2-neuralmagic reviewed Nov 25, 2024

View reviewed changes

vllm/commit_id.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Nov 25, 2024

View reviewed changes

vllm/entrypoints/openai/serving_engine.py Outdated Show resolved Hide resolved

tomeras91 added 4 commits November 25, 2024 15:07

don't block GIL in tokenization (preprocess) in OpenAI compatible ser…

6af8e61

…ver by using threadpool for tokenization Signed-off-by: Tomer Asida <[email protected]>

format

821665b

Signed-off-by: Tomer Asida <[email protected]>

remove commit_id that was mistakenly added

5f2164a

Signed-off-by: Tomer Asida <[email protected]>

simpler - just assign methods in init

dd01b53

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 force-pushed the async-tokenization-in-oai-server branch from 5162f66 to dd01b53 Compare November 25, 2024 13:07

tomeras91 added 3 commits November 25, 2024 15:08

format

4a6efcb

Signed-off-by: Tomer Asida <[email protected]>

async tokenization also in serving_score.py

f89eaa0

Signed-off-by: Tomer Asida <[email protected]>

format

980fff8

Signed-off-by: Tomer Asida <[email protected]>

njhill reviewed Nov 25, 2024

View reviewed changes

tomeras91 added 2 commits November 25, 2024 20:37

no need to make self._tokenize_prompt_inputs async as it's used only …

da646c1

…in self._tokenize_prompt_input Signed-off-by: Tomer Asida <[email protected]>

make self._tokenize_prompt_input_or_inputs return a list so make_asyn…

b61a04f

…c will actually make execution run in thread and not just generator creation Signed-off-by: Tomer Asida <[email protected]>

tomeras91 added 2 commits November 26, 2024 16:16

introduce threadsafe tokenizer and use in MQLLMEngineClient

e4cb992

Signed-off-by: Tomer Asida <[email protected]>

format

e59cc81

Signed-off-by: Tomer Asida <[email protected]>

Use ThreadPoolExecutor with max_workers=1 to make tokenization async.…

f0c0a2f

… No need for threadsafe tokenizer anymore since all tokenization happens on the same thread Signed-off-by: Tomer Asida <[email protected]>

tomeras91 requested a review from simon-mo as a code owner November 27, 2024 11:50

Add tests to validate that (1) truncated and non-truncated requests c…

b35a063

…an be sent concurrently and (2) that /health response time is short under high tokenization load Signed-off-by: Tomer Asida <[email protected]>

tomeras91 force-pushed the async-tokenization-in-oai-server branch from 806d0ee to b35a063 Compare November 27, 2024 12:00

add comment

ff1d6a9

Signed-off-by: Tomer Asida <[email protected]>

tomeras91 commented Nov 27, 2024

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2024

njhill approved these changes Nov 27, 2024

View reviewed changes

njhill merged commit 395b1c7 into vllm-project:main Nov 27, 2024
57 checks passed

tomeras91 deleted the async-tokenization-in-oai-server branch November 28, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server #10635

[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server #10635

tomeras91 commented Nov 25, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 25, 2024

robertgshaw2-neuralmagic commented Nov 25, 2024 •

edited

Loading

tomeras91 commented Nov 25, 2024

njhill left a comment

njhill Nov 25, 2024

tomeras91 Nov 25, 2024

njhill commented Nov 25, 2024

tomeras91 commented Nov 25, 2024 •

edited

Loading

njhill commented Nov 26, 2024

tomeras91 commented Nov 26, 2024 •

edited

Loading

njhill commented Nov 26, 2024

tomeras91 commented Nov 26, 2024 •

edited

Loading

tomeras91 commented Nov 26, 2024

tomeras91 left a comment •

edited

Loading

tomeras91 Nov 27, 2024

tomeras91 Nov 27, 2024

tomeras91 Nov 27, 2024

tomeras91 commented Nov 27, 2024

njhill left a comment

[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server #10635

[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server #10635

Conversation

tomeras91 commented Nov 25, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 25, 2024

robertgshaw2-neuralmagic commented Nov 25, 2024 • edited Loading

tomeras91 commented Nov 25, 2024

njhill left a comment

Choose a reason for hiding this comment

njhill Nov 25, 2024

Choose a reason for hiding this comment

tomeras91 Nov 25, 2024

Choose a reason for hiding this comment

njhill commented Nov 25, 2024

tomeras91 commented Nov 25, 2024 • edited Loading

Updated benchmarks

RE something to show this actually solves the problem:

njhill commented Nov 26, 2024

tomeras91 commented Nov 26, 2024 • edited Loading

Fix - explicit lock

Benchmarks

njhill commented Nov 26, 2024

tomeras91 commented Nov 26, 2024 • edited Loading

tomeras91 commented Nov 26, 2024

tomeras91 left a comment • edited Loading

Choose a reason for hiding this comment

tomeras91 Nov 27, 2024

Choose a reason for hiding this comment

tomeras91 Nov 27, 2024

Choose a reason for hiding this comment

tomeras91 Nov 27, 2024

Choose a reason for hiding this comment

tomeras91 commented Nov 27, 2024

Updated (hopefully last) Benchmarks

njhill left a comment

Choose a reason for hiding this comment

tomeras91 commented Nov 25, 2024 •

edited by github-actions bot

Loading

robertgshaw2-neuralmagic commented Nov 25, 2024 •

edited

Loading

tomeras91 commented Nov 25, 2024 •

edited

Loading

tomeras91 commented Nov 26, 2024 •

edited

Loading

tomeras91 commented Nov 26, 2024 •

edited

Loading

tomeras91 left a comment •

edited

Loading