[Question]: About the speed of the embeddings generation #8464

terilias · 2023-10-24T11:40:37Z

terilias
Oct 24, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hello,
I have a question regarding the embeddings generation process in LlamaIndex. If I use from llama_index.embeddings import HuggingFaceEmbedding instead of LangChain integration (from llama_index import VectorStoreIndex, SimpleDirectoryReader, LangchainEmbedding) the time needed for the embeddings generation is longer. I timed the process and I found that the LangChain integration needs 4 to 7 times shorter time to generate the embeddings using the same model (paraphrase-MiniLM-L6-v2). Do you know why this happens? How can I increase the speed of the LlamaIndex HuggingFaceEmbedding? I need it because I want to use as few libraries as possible. So I prefer to keep only LlamaIndex instead of both frameworks in my project's requirements.

@dosu-bot · 2023-10-24T11:53:13Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

The difference in speed between the HuggingFaceEmbedding and LangChainEmbedding could be due to several factors such as device utilization, batch size, and model complexity.

The HuggingFaceEmbedding class uses the device inferred by the infer_torch_device() function. If this function is defaulting to a CPU instead of a GPU, the process could be slower. GPUs are generally faster at processing large amounts of data, which is common in machine learning tasks.

The embed_batch_size parameter in the HuggingFaceEmbedding class determines how many sentences are processed at once. If this value is small, the process could be slower because it needs to make more iterations. The default value for this parameter is DEFAULT_EMBED_BATCH_SIZE.

The complexity of the model used for embeddings can also affect the speed. More complex models might generate better embeddings but take longer to process.

To increase the speed of the LlamaIndex HuggingFaceEmbedding, you can try the following:

Use a GPU: If you have a GPU available, ensure that the infer_torch_device() function is correctly identifying and using it.
Increase the Batch Size: If your system has enough memory, you can increase the embed_batch_size parameter to process more sentences at once.
Use a Simpler Model: If speed is more important than the quality of the embeddings, you could use a simpler model for embeddings.
Use OptimumEmbedding: LlamaIndex supports OptimumEmbedding, which uses ONNX models for improved inference speeds. You can create and use these models with the Optimum library from HuggingFace.

Here is an example of how you can modify the HuggingFaceEmbedding initialization to use a GPU and increase the batch size:

embedding = HuggingFaceEmbedding(
    model_name='bert-base-uncased',
    tokenizer_name='bert-base-uncased',
    embed_batch_size=512,  # Increase batch size
    device='cuda'  # Use GPU
)

And here is how you can create and use an OptimumEmbedding:

from llama_index.embeddings import OptimumEmbedding

OptimumEmbedding.create_and_save_optimum_model(
    "BAAI/bge-small-en-v1.5", "./bge_onnx"
)

embed_model = OptimumEmbedding(folder_name="./bge_onnx")

Please note that the actual values for model_name, tokenizer_name, embed_batch_size, and device should be set according to your specific requirements and system capabilities.

You can find more information in the llama_index/embeddings/huggingface.py and docs/examples/embeddings/huggingface.ipynb files.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Disiok · 2023-10-24T22:09:21Z

Disiok
Oct 24, 2023
Maintainer

Could you try the OptimumEmbedding version? It should be faster.

1 reply

terilias Oct 25, 2023
Author

I'll check it, thank you!

terilias · 2023-10-25T10:44:19Z

terilias
Oct 25, 2023
Author

If I understand correctly based on the relative python file the HuggingFaceEmbedding module by LlamaIndex communicates with the HuggingFace API over the internet. Right? If yes, then maybe the cause of the slow performance is due to the API calls...

1 reply

Disiok Oct 25, 2023
Maintainer

The file you linked is for the LLM (either local or via the inference API).

I think your original question is about embedding model? Note that the LLM and embedding model are typically different models.

If you use hugging face embedding or optimum embedding, they should be running locally.

SighingSnow · 2023-11-24T12:07:59Z

SighingSnow
Nov 24, 2023

Hello, I use the OptinumEmbedding model for embedding. However it seems that ONNX model is running on cpu instead of gpu. Is there anything to set ?

2 replies

terilias Nov 30, 2023
Author

Hello, I have only experiment briefly with ONNX but normally when I start the embeddings calculation using local HuggingFace embeddings it shows "UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML")
This is because I have not installed the GPU drivers properly. Maybe your problem is due to this reason.

SighingSnow Dec 1, 2023

Thanks a lot, I will try this method 😍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: About the speed of the embeddings generation #8464

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: About the speed of the embeddings generation #8464

terilias Oct 24, 2023

Question Validation

Question

Replies: 4 comments · 4 replies

dosubot[bot] bot Oct 24, 2023

Sources

Disiok Oct 24, 2023 Maintainer

terilias Oct 25, 2023 Author

terilias Oct 25, 2023 Author

Disiok Oct 25, 2023 Maintainer

SighingSnow Nov 24, 2023

terilias Nov 30, 2023 Author

SighingSnow Dec 1, 2023

terilias
Oct 24, 2023

Replies: 4 comments 4 replies

dosubot[bot]
bot Oct 24, 2023

Disiok
Oct 24, 2023
Maintainer

terilias Oct 25, 2023
Author

terilias
Oct 25, 2023
Author

Disiok Oct 25, 2023
Maintainer

SighingSnow
Nov 24, 2023

terilias Nov 30, 2023
Author