[Question]: Slowness of LLM inference step using chat engine #8429
Replies: 9 comments 4 replies
-
🤖 Hello, Thank you for your detailed question. It seems like you've done a thorough job of ensuring that the settings for the LLM are the same in both the webui project and the LlamaIndex project. The difference in speed between the two projects could be due to a number of factors. One possibility is that the implementations of the 'generate' function in the two projects are different. In the LlamaIndex project, the 'generate' function is part of the LLMQuestionGenerator class and generates sub-questions based on the provided tools and query. On the other hand, in the webui project, the 'generate' function is part of the OpenAIQuestionGenerator class and generates a list of sub-questions based on the provided query and tools. Another possibility is that the GPU utilization is different in the two projects. You mentioned that the GPU utilization during the generate function in the webui project is much higher than that in the LlamaIndex project. This could be due to differences in how the two projects are set up to use the GPU. For example, the LlamaIndex project might be using a different version of PyTorch or a different CUDA toolkit version, which could affect GPU utilization. To investigate this further, you could try profiling the 'generate' function in both projects to see where the time is being spent. This could help identify any bottlenecks or inefficiencies in the code. You could also try running the projects on a different GPU to see if the issue is specific to the T4 GPUs. I hope this helps. If you have any further questions, please don't hesitate to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-bot just to be super clear, I meant the generate function here and llama2-ui.
both were run using the same hardware setup and conda environment i.e. so same package versions. |
Beta Was this translation helpful? Give feedback.
-
@Swarnashree I think it's likely slow because we are packing the context window to near its limit. If you just prompt the llm with But if you fill the context window with something like 3900 tokens, inference will be slower |
Beta Was this translation helpful? Give feedback.
-
Yes, I agree, just to mention, I copied the final RAG based prompt that was generated by llama-index project onto the llama2-ui project. So both have the same incoming prompt into the llm which was around 2.5k tokens in both |
Beta Was this translation helpful? Give feedback.
-
If you spot something that could make it faster in our source code, I encourage a PR! |
Beta Was this translation helpful? Give feedback.
-
Interesting. If you have time to debug this it would be appreciated, off-hand I'm not sure what the difference is |
Beta Was this translation helpful? Give feedback.
-
Hi @logan-markewich, after spending some time debugging, I see that the bottleneck comes from starting the thread for To dig further, I tried all combinations of using/not using background(bg) threads: func1 on bg thread, func2 not on bg thread: speed of inference is as expected (:+1: ) This probably means the GPU performance is being degraded due to the over multi-threading. whats the best way to resolve this? |
Beta Was this translation helpful? Give feedback.
-
@Swarnashree amazing debugging! I'm not 100% sure on how best to resolve this 🤔 the thread is needed in order for streaming and the chat history to work. I'll look into this and chat with the team, and see if we can figure out a fix. |
Beta Was this translation helpful? Give feedback.
-
Alright, I took a look. And, I think what's happening here is that we may be creating a race condition on the data for
chat_response = StreamingAgentChatResponse(
chat_stream=self._llm.stream_chat(all_messages), # this itself spawns a thread; race condition on this data
sources=[
ToolOutput(
tool_name="retriever",
content=str(prefix_messages[0]),
raw_input={"message": message},
raw_output=prefix_messages[0],
)
],
source_nodes=nodes,
)
thread = Thread(
target=chat_response.write_response_to_history, args=(self._memory,)
) We've got one thread (i.e., one for The analysis conducted by @Swarnashree corroborates this — in that, when only one of these calls are put on a Thread, then there is no race condition; and thus no slow down. If this is in fact what's going on here, then we could use a "lock" pattern to make this more thread safe. But also, I am wondering if we need the @Swarnashree: how did you take off (CC: @Disiok) |
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
Hi, I am trying to setup a basic RAG chat pipeline with the below code and I am using Mistral 7B instruct model. I am noticing the generate function under the hood is a lot slower than the generate function from liltom-eth/llama2-webui setup for mistral model. (all other steps of retrieving nodes to be used in the prompt etc are fast enough. Only the step of generating content from the LLM is slow)
I made sure to have the same settings for the LLM on both the webui project and also the llama-index project by setting these in the llama-index project:
Llama-index RAG pipeline is as follows:
Any help on where the speed difference is coming, would be appreciated
I am running them on 4 T4 GPUs and also noticed the GPU utilization during the generate function in the webui project is much higher % than that in the llama-index project which makes me wonder if there is some inefficiency of GPU utilization with the way I am setting up the RAG pipeline
Beta Was this translation helpful? Give feedback.
All reactions