-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] get_computed_blocks
avoids recompute
#10695
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Abatom <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
I don't think this is correct. In this way you only share the blocks within a request. |
@comaniac Thank you, I will think about it again. |
@comaniac I have drawn a diagram to illustrate my modifications. I believe that the cache is not fully shared within a request. |
Signed-off-by: Abatom <[email protected]>
What I'm saying is you change makes blocks cannot be shared "across" requests. |
@comaniac The green blocks have already been calculated,so there's no need to calculate them again,this is where the optimization comes in. |
How do you hit cached blocks for a new request, which hasn't been added to req_to_blocks[request.request_id]? |
Ok I see what you meant. Yes this saves hash computation, but I'm afraid that this makes the code complicate. Did you have benchmark numbers to show the overhead is negligible so we have to introduce this optimization? |
Alright, I'll think about how to conduct benchmark testing. |
Use
self.req_to_blocks[request.request_id]
instead of recompute.