[Question]: Slowness of LLM inference step using chat engine #8429

Swarnashree · 2023-10-10T22:11:01Z

Swarnashree
Oct 10, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi, I am trying to setup a basic RAG chat pipeline with the below code and I am using Mistral 7B instruct model. I am noticing the generate function under the hood is a lot slower than the generate function from liltom-eth/llama2-webui setup for mistral model. (all other steps of retrieving nodes to be used in the prompt etc are fast enough. Only the step of generating content from the LLM is slow)

I made sure to have the same settings for the LLM on both the webui project and also the llama-index project by setting these in the llama-index project:

model_kwargs={"torch_dtype": torch.float16},
generate_kwargs={"do_sample": True, "top_p": 0.95, "top_k": 50, "temperature": 0.8, "num_beams":1}

Llama-index RAG pipeline is as follows:

from llama_index.llms import HuggingFaceLLM
from llama_index import LangchainEmbedding
from llama_index.langchain_helpers.text_splitter import SentenceSplitter
from langchain.embeddings import HuggingFaceEmbeddings

llm = HuggingFaceLLM(
    system_prompt="Consider you are a smart finanical assistant who has the capability to interpret SEC filings and answer the questions for the given text as expected.",
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.1",
    model_name="mistralai/Mistral-7B-Instruct-v0.1",
    device_map="auto",
   model_kwargs={"torch_dtype": torch.float16},
   generate_kwargs={"do_sample": True, "top_p": 0.95, "top_k": 50, "temperature": 0.8, "num_beams":1}
)

prompt_helper = PromptHelper.from_llm_metadata(llm_metadata=llm.metadata)
embed_model = LangchainEmbedding(langchain_embeddings=HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cuda:1'},
    encode_kwargs={'normalize_embeddings': False}
    ))
service_context = ServiceContext.from_defaults(
    llm=llm,
    prompt_helper=prompt_helper,
    node_parser=SimpleNodeParser(text_splitter=SentenceSplitter()),
    embed_model=embed_model)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)
chat_engine = index.as_chat_engine(chat_mode='context',retriever_mode='embedding', similarity_top_k=2)
        
stream_response=chat_engine.stream_chat("what are the tax implications of the said filing?")
stream_response.print_response_stream()

Any help on where the speed difference is coming, would be appreciated
I am running them on 4 T4 GPUs and also noticed the GPU utilization during the generate function in the webui project is much higher % than that in the llama-index project which makes me wonder if there is some inefficiency of GPU utilization with the way I am setting up the RAG pipeline

@dosu-bot · 2023-10-10T22:18:13Z

dosubot[bot]
bot Oct 10, 2023

🤖

Hello,

Thank you for your detailed question. It seems like you've done a thorough job of ensuring that the settings for the LLM are the same in both the webui project and the LlamaIndex project.

The difference in speed between the two projects could be due to a number of factors. One possibility is that the implementations of the 'generate' function in the two projects are different. In the LlamaIndex project, the 'generate' function is part of the LLMQuestionGenerator class and generates sub-questions based on the provided tools and query. On the other hand, in the webui project, the 'generate' function is part of the OpenAIQuestionGenerator class and generates a list of sub-questions based on the provided query and tools.

Another possibility is that the GPU utilization is different in the two projects. You mentioned that the GPU utilization during the generate function in the webui project is much higher than that in the LlamaIndex project. This could be due to differences in how the two projects are set up to use the GPU. For example, the LlamaIndex project might be using a different version of PyTorch or a different CUDA toolkit version, which could affect GPU utilization.

To investigate this further, you could try profiling the 'generate' function in both projects to see where the time is being spent. This could help identify any bottlenecks or inefficiencies in the code. You could also try running the projects on a different GPU to see if the issue is specific to the T4 GPUs.

I hope this helps. If you have any further questions, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Swarnashree · 2023-10-10T22:26:06Z

Swarnashree
Oct 10, 2023
Author

@dosu-bot just to be super clear, I meant the generate function here and llama2-ui.
As you can see, it is not related to the LLMQuestionGenerator or OpenAIQuestionGenerator classes.

For example, the LlamaIndex project might be using a different version of PyTorch or a different CUDA toolkit version, which could affect GPU utilization.

both were run using the same hardware setup and conda environment i.e. so same package versions.

0 replies

logan-markewich · 2023-10-10T22:27:30Z

logan-markewich
Oct 10, 2023
Maintainer

@Swarnashree I think it's likely slow because we are packing the context window to near its limit.

If you just prompt the llm with llm.complete("Hello!") it will return very fast.

But if you fill the context window with something like 3900 tokens, inference will be slower

0 replies

Swarnashree · 2023-10-10T22:28:21Z

Swarnashree
Oct 10, 2023
Author

@Swarnashree I think it's likely slow because we are packing the context window to near its limit.

If you just prompt the llm with llm.complete("Hello!") it will return very fast.

But if you fill the context window with something like 3900 tokens, inference will be slower

Yes, I agree, just to mention, I copied the final RAG based prompt that was generated by llama-index project onto the llama2-ui project. So both have the same incoming prompt into the llm which was around 2.5k tokens in both

0 replies

logan-markewich · 2023-10-10T22:28:39Z

logan-markewich
Oct 10, 2023
Maintainer

If you spot something that could make it faster in our source code, I encourage a PR!
https://github.com/run-llama/llama_index/blob/main/llama_index/llms/huggingface.py

0 replies

logan-markewich · 2023-10-10T22:29:27Z

logan-markewich
Oct 10, 2023
Maintainer

Yes, I agree, just to mention, I copied the final RAG based prompt that was generated by llama-index project onto the llama2-ui project. So both have the same incoming prompt into the llm

Interesting. If you have time to debug this it would be appreciated, off-hand I'm not sure what the difference is

0 replies

Swarnashree · 2023-10-11T21:07:53Z

Swarnashree
Oct 11, 2023
Author

Interesting. If you have time to debug this it would be appreciated, off-hand I'm not sure what the difference is

Hi @logan-markewich, after spending some time debugging, I see that the bottleneck comes from starting the thread for write_response_history in context.py code

To dig further, I tried all combinations of using/not using background(bg) threads:
i.e. self._model.generate thread in huggingface.py code(lets call it func1) and write_response_history in context.py (lets call it func2) and bottleneck is evident ONLY when both are on bg threads

func1 on bg thread, func2 not on bg thread: speed of inference is as expected (:+1: )
func1 on bg thread, func2 on bg thread: speed is not as expected (:-1: )
func1 not on bg thread, func2 on bg thread: speed of inference is as expected (:+1: )
func1 not on bg thread, func2 not on bg thread: speed of inference is as expected (:+1: )

This probably means the GPU performance is being degraded due to the over multi-threading. whats the best way to resolve this?

1 reply

dosubot[bot] bot Oct 24, 2023

🤖

Thank you for your detailed analysis and for identifying the potential issue with multi-threading in the LlamaIndex project. It's clear that the self._model.generate function in huggingface.py and write_response_history in context.py are causing a slowdown when both are running on background threads.

Based on your findings and the information from the similar issue you've mentioned, it seems that the overuse of multi-threading is indeed causing a degradation in GPU performance. This is likely due to the fact that both functions are trying to access the GPU resources simultaneously, leading to contention and thus a slowdown.

One potential solution could be to ensure that these two functions do not run concurrently. This could be achieved by using a threading lock or a semaphore to ensure that only one of these functions can access the GPU at a time. This would prevent the overuse of multi-threading and should improve the GPU utilization.

Here's a simple example of how you could implement this using a threading lock:

import threading

# Create a lock
lock = threading.Lock()

# Use the lock when calling the functions
with lock:
    # Call the generate function
    self._model.generate(...)

with lock:
    # Call the write_response_history function
    write_response_history(...)

This is just a basic example and you would need to adapt it to fit the structure of the LlamaIndex project. Please try this solution and let us know if it helps to improve the performance.

We appreciate your patience and your contribution to improving the LlamaIndex project.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2023-10-12T00:26:23Z

logan-markewich
Oct 12, 2023
Maintainer

@Swarnashree amazing debugging!

I'm not 100% sure on how best to resolve this 🤔 the thread is needed in order for streaming and the chat history to work.

I'll look into this and chat with the team, and see if we can figure out a fix.

3 replies

Swarnashree Oct 26, 2023
Author

hi @logan-markewich
any updates on this?

logan-markewich Oct 26, 2023
Maintainer

Sorry @Swarnashree I'm away from work for the next week or so.

Maybe @nerdai or @Disiok have some ideas on how the performance could be improved for streaming when streaming huggingface LLMs with threads (or why it's slower in the first place)

nerdai Oct 26, 2023
Maintainer

@Swarnashree Thanks for your patience. I'll do some digging at some point today, and I'll report back here what I find (in terms of problems and hopefully potential solutions/improvements)

nerdai · 2023-10-26T22:59:31Z

nerdai
Oct 26, 2023
Maintainer

Alright, I took a look. And, I think what's happening here is that we may be creating a race condition on the data for chat_stream:

when invoking chat_engine.stream_chat() a Thread for writing response to history gets created
before writing to history, of course, we first query the LLM to generate the response. However, for HuggingFaceLLM (as well as LangChainLLM) this spins up another Thread (not the same as before).
so we've got two threads here that the OS will determine the order in which they're run, but I think more problematic is a potential race condition on chat_stream

        chat_response = StreamingAgentChatResponse(
            chat_stream=self._llm.stream_chat(all_messages), # this itself spawns a thread; race condition on this data
            sources=[
                ToolOutput(
                    tool_name="retriever",
                    content=str(prefix_messages[0]),
                    raw_input={"message": message},
                    raw_output=prefix_messages[0],
                )
            ],
            source_nodes=nodes,
        )
        thread = Thread(
            target=chat_response.write_response_to_history, args=(self._memory,)
        )

We've got one thread (i.e., one for HuggingFaceLLM.generate) that is writing to this part of the memory; and another (i.e. one for chat_response.write_response_to_history) that is trying to access it for reading.

The analysis conducted by @Swarnashree corroborates this — in that, when only one of these calls are put on a Thread, then there is no race condition; and thus no slow down.

If this is in fact what's going on here, then we could use a "lock" pattern to make this more thread safe. But also, I am wondering if we need the Thread for invoking the LLM.generate to begin with?

@Swarnashree: how did you take off func1 from going on a Thread?

(CC: @Disiok)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Slowness of LLM inference step using chat engine #8429

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Question]: Slowness of LLM inference step using chat engine #8429

Swarnashree Oct 10, 2023

Question Validation

Question

Replies: 9 comments · 4 replies

dosubot[bot] bot Oct 10, 2023

Sources

Swarnashree Oct 10, 2023 Author

logan-markewich Oct 10, 2023 Maintainer

Swarnashree Oct 10, 2023 Author

logan-markewich Oct 10, 2023 Maintainer

logan-markewich Oct 10, 2023 Maintainer

Swarnashree Oct 11, 2023 Author

dosubot[bot] bot Oct 24, 2023

Sources

logan-markewich Oct 12, 2023 Maintainer

Swarnashree Oct 26, 2023 Author

logan-markewich Oct 26, 2023 Maintainer

nerdai Oct 26, 2023 Maintainer

nerdai Oct 26, 2023 Maintainer

Swarnashree
Oct 10, 2023

Replies: 9 comments 4 replies

dosubot[bot]
bot Oct 10, 2023

Swarnashree
Oct 10, 2023
Author

logan-markewich
Oct 10, 2023
Maintainer

Swarnashree
Oct 10, 2023
Author

logan-markewich
Oct 10, 2023
Maintainer

logan-markewich
Oct 10, 2023
Maintainer

Swarnashree
Oct 11, 2023
Author

logan-markewich
Oct 12, 2023
Maintainer

Swarnashree Oct 26, 2023
Author

logan-markewich Oct 26, 2023
Maintainer

nerdai Oct 26, 2023
Maintainer

nerdai
Oct 26, 2023
Maintainer