Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] detokenize_incrementally: OverflowError: out of range integral type conversion attempted #1739

Open
2 tasks done
josephrocca opened this issue Jun 7, 2024 · 5 comments
Assignees

Comments

@josephrocca
Copy link

josephrocca commented Jun 7, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

Most API requests are successful, but the error in the title randomly occurs sometimes. I haven't worked out how to reliably reproduce it yet. I'm not doing anything special - just loading a Llama 2 70B AWQ model on that I created with lmdeploy lite auto_awq on a dual 4090 machine.

The same model works fine in vLLM.

Reproduction

Seems to occur randomly when many concurrent requests are sent. I will update this issue if I find any way to reproduce it consistently. Here are the arguments:

lmdeploy serve api_server MODEL_NAME --tp 2 --session-len 4096 --model-format awq --quant-policy 4 --model-name MODEL_NAME --enable-prefix-caching

I'm using the /v1/completions endpoint (not the chat completion endpoint). The tokenizer class is LlamaTokenizer.

I'm hoping that the error log below sparks an idea at what the cause could be 🤞

Environment

Latest official Docker image: openmmlab/lmdeploy:v0.4.2

Error traceback

2024-06-07T23:16:22.614538430Z ERROR:    Exception in ASGI application
2024-06-07T23:16:22.614568648Z Traceback (most recent call last):
2024-06-07T23:16:22.614572556Z   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 265, in __call__
2024-06-07T23:16:22.614576223Z     await wrap(partial(self.listen_for_disconnect, receive))
2024-06-07T23:16:22.614579830Z   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 261, in wrap
2024-06-07T23:16:22.614584258Z     await func()
2024-06-07T23:16:22.614587845Z   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
2024-06-07T23:16:22.614591793Z     message = await receive()
2024-06-07T23:16:22.614595310Z   File "/opt/py38/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
2024-06-07T23:16:22.614598726Z     await self.message_event.wait()
2024-06-07T23:16:22.614602193Z   File "/usr/lib/python3.8/asyncio/locks.py", line 309, in wait
2024-06-07T23:16:22.614605649Z     await fut
2024-06-07T23:16:22.614609046Z asyncio.exceptions.CancelledError
2024-06-07T23:16:22.614612433Z 
2024-06-07T23:16:22.614615789Z During handling of the above exception, another exception occurred:
2024-06-07T23:16:22.614619306Z 
2024-06-07T23:16:22.614622622Z   + Exception Group Traceback (most recent call last):
2024-06-07T23:16:22.614626079Z   |   File "/opt/py38/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
2024-06-07T23:16:22.614629485Z   |     result = await app(  # type: ignore[func-returns-value]
2024-06-07T23:16:22.614632882Z   |   File "/opt/py38/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
2024-06-07T23:16:22.614636258Z   |     return await self.app(scope, receive, send)
2024-06-07T23:16:22.614639645Z   |   File "/opt/py38/lib/python3.8/site-packages/fastapi/applications.py", line 1054, in __call__
2024-06-07T23:16:22.614643032Z   |     await super().__call__(scope, receive, send)
2024-06-07T23:16:22.614646458Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/applications.py", line 123, in __call__
2024-06-07T23:16:22.614649825Z   |     await self.middleware_stack(scope, receive, send)
2024-06-07T23:16:22.614653191Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/middleware/errors.py", line 186, in __call__
2024-06-07T23:16:22.614656588Z   |     raise exc
2024-06-07T23:16:22.614660014Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/middleware/errors.py", line 164, in __call__
2024-06-07T23:16:22.614663391Z   |     await self.app(scope, receive, _send)
2024-06-07T23:16:22.614666737Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/middleware/cors.py", line 93, in __call__
2024-06-07T23:16:22.614670124Z   |     await self.simple_response(scope, receive, send, request_headers=headers)
2024-06-07T23:16:22.614674382Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/middleware/cors.py", line 148, in simple_response
2024-06-07T23:16:22.614677849Z   |     await self.app(scope, receive, send)
2024-06-07T23:16:22.614681285Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
2024-06-07T23:16:22.614684702Z   |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
2024-06-07T23:16:22.614688088Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-07T23:16:22.614702406Z   |     raise exc
2024-06-07T23:16:22.614706003Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-07T23:16:22.614709389Z   |     await app(scope, receive, sender)
2024-06-07T23:16:22.614712786Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/routing.py", line 756, in __call__
2024-06-07T23:16:22.614716153Z   |     await self.middleware_stack(scope, receive, send)
2024-06-07T23:16:22.614719489Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/routing.py", line 776, in app
2024-06-07T23:16:22.614722825Z   |     await route.handle(scope, receive, send)
2024-06-07T23:16:22.614726182Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/routing.py", line 297, in handle
2024-06-07T23:16:22.614729528Z   |     await self.app(scope, receive, send)
2024-06-07T23:16:22.614732885Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/routing.py", line 77, in app
2024-06-07T23:16:22.614736241Z   |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
2024-06-07T23:16:22.614739628Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-07T23:16:22.614743084Z   |     raise exc
2024-06-07T23:16:22.614746571Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-07T23:16:22.614749988Z   |     await app(scope, receive, sender)
2024-06-07T23:16:22.614753324Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/routing.py", line 75, in app
2024-06-07T23:16:22.614756671Z   |     await response(scope, receive, send)
2024-06-07T23:16:22.614760017Z   |   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 265, in __call__
2024-06-07T23:16:22.614763454Z   |     await wrap(partial(self.listen_for_disconnect, receive))
2024-06-07T23:16:22.614766870Z   |   File "/opt/py38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
2024-06-07T23:16:22.614770227Z   |     raise BaseExceptionGroup(
2024-06-07T23:16:22.614773613Z   | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
2024-06-07T23:16:22.614776980Z   +-+---------------- 1 ----------------
2024-06-07T23:16:22.614780346Z     | Traceback (most recent call last):
2024-06-07T23:16:22.614784043Z     |   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 261, in wrap
2024-06-07T23:16:22.614787440Z     |     await func()
2024-06-07T23:16:22.614790827Z     |   File "/opt/py38/lib/python3.8/site-packages/starlette/responses.py", line 250, in stream_response
2024-06-07T23:16:22.614794253Z     |     async for chunk in self.body_iterator:
2024-06-07T23:16:22.614797570Z     |   File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 778, in completion_stream_generator
2024-06-07T23:16:22.614800916Z     |     async for res in generator:
2024-06-07T23:16:22.614804272Z     |   File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 633, in generate
2024-06-07T23:16:22.614807659Z     |     response, state = self.tokenizer.detokenize_incrementally(
2024-06-07T23:16:22.614810985Z     |   File "/opt/lmdeploy/lmdeploy/tokenizer.py", line 569, in detokenize_incrementally
2024-06-07T23:16:22.614814502Z     |     return self.model.detokenize_incrementally(
2024-06-07T23:16:22.614817879Z     |   File "/opt/lmdeploy/lmdeploy/tokenizer.py", line 425, in detokenize_incrementally
2024-06-07T23:16:22.614821265Z     |     new_tokens = tokenizer.convert_ids_to_tokens(
2024-06-07T23:16:22.614824642Z     |   File "/opt/py38/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 399, in convert_ids_to_tokens
2024-06-07T23:16:22.614828058Z     |     tokens.append(self._tokenizer.id_to_token(index))
2024-06-07T23:16:22.614831415Z     | OverflowError: out of range integral type conversion attempted
2024-06-07T23:16:22.614834771Z     +------------------------------------


# Then after sending another request:

2024-06-07T23:17:17.695793520Z terminate called after throwing an instance of 'std::runtime_error'
2024-06-07T23:17:17.695805133Z   what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /opt/lmdeploy/src/turbomind/utils/allocator.h:231 
2024-06-07T23:17:17.695807778Z 
2024-06-07T23:17:17.695809922Z terminate called recursively
@zhyncs
Copy link
Collaborator

zhyncs commented Jun 8, 2024

Please provide the code for the client that can be used for reproduction, thanks.

@josephrocca
Copy link
Author

I will do my best to get a reliable reproduction of this detokenize issue. Possibly not related, since there doesn't seem to be any tokenizer issues in the logs, but maybe worth referencing since it has the same "an illegal memory access" message:

@medwang1
Copy link

medwang1 commented Jun 12, 2024

@zhyncs 可以这样复现:

wrk -t10 -c100 -d30s -s 01_post.lua --latency http://0.0.0.0:8081/v1/chat/completions

01_post.lua file:

wrk.method = "POST"
wrk.body = [[
	{
		"model": "yi",
		"temperature": 0.7,
		"messages": [
			{
				"role": "user",
				"content": "worker_rlimit_nofile 是一个在 Nginx 或其他基于 Unix-like 系统的 Web 服务器配置中的指令,用于设置工作进程可以打开的最大文件描述符数。这个设置对于服务器性能有重要影响,因为它决定了服务器可以同时处理多少个并发连接。在这里,655350 是设置的具体数值。这个数值设置的相当高,意味着服务器配置了非常高的并发处理能力。在 Unix-like 系统中,文件描述符用于访问所有类型的文件,包括网络套接字。因此,增加这个限制可以让服务器处理更多的并发请求,特别是对于需要处理大量静态文件或者提供大量 Web 服务的场景。设置这个值通常需要服务器管理员有适当的权限,并且可能需要在系统级别进行相应的调整,因为操作系统也有自己的限制。在实际应用中,服务器管理员需要根据服务器的硬件资源、预期的负载以及实际的应用场景来合理设置这个值,以确保服务器既能充分利用资源,又不会因为超过系统限制而导致性能问题。"
			}
		],
		"stream": false,
		"max_tokens": 0
	}]]
wrk.headers["Content-Type"] = "application/json"

部署模型的模型是:
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server ./Yi-1.5-9B-Chat --server-port 8081 --model-name yi --cache-max-entry-count 0.9 --tp 1 --session-len 4096 --enable-prefix-caching

@medwang1
Copy link

#1619

感觉这几个是一个问题,content 内容很长,开的并发很高,就会触发这个问题

@lvhan028
Copy link
Collaborator

@lzhangzz could you please investigate this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants