Why do chunking based on cl100k_base tokens instead of character length? #9454

mroedder-d7 · 2023-12-12T08:27:26Z

mroedder-d7
Dec 12, 2023

Why does Llama-Index do chunking based on cl100k_base tokens (per default) instead of character length?

To me this is quite unintuitive, as:

competitor frameworks (Langchain) don´t do it like this.
if I were to use the chunks to embed them for semantic search, in 99% of the cases I would not use cl100k_base as the tokenizer, but (e.g.) ada-002. So even there it is not a helpful forecast for the expected space in the future context length.

So at best this logic does not gain anything and just uses CPU time, and at the worst this is confusing all users that rely on the defaults.

I would like to try to understand where the motivation to use cl100k_base came from and if it still makes sense to keep this default. 🙂

@mroedder-d7 · 2023-12-12T08:38:36Z

dosubot[bot]
bot Dec 12, 2023

🤖

Hi @mroedder-d7! Good to see you around here again. Always appreciating your insightful questions. Hope you're doing fantastic!

The choice to use cl100k_base tokens for chunking in LlamaIndex, instead of character length, is likely due to the specific requirements and design considerations of the framework. However, I couldn't find specific documentation or discussions in the repository that explain why cl100k_base was chosen as the default.

The chunking process in LlamaIndex is handled in the stream_complete method of the Bedrock class, as detailed in the context provided. This method is used for models that support streaming and it operates on chunks of data as they become available.

As for the comparison between cl100k_base tokenization and character length chunking, I couldn't find specific information within the repository. The choice between these two methods could be influenced by various factors such as the specific use case, the nature of the data being processed, and the requirements of the downstream tasks.

If you find the default chunking method based on cl100k_base tokens unintuitive or inefficient for your use case, you might consider implementing a custom chunking method that better suits your needs. However, please note that this might require a deep understanding of the LlamaIndex framework and the specific models you are working with.

I hope this helps! If you have further questions or need more clarification, feel free to ask. Dosu

Sources

llama_index/llms/bedrock.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do chunking based on cl100k_base tokens instead of character length? #9454

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why do chunking based on cl100k_base tokens instead of character length? #9454

mroedder-d7 Dec 12, 2023

Replies: 1 comment

dosubot[bot] bot Dec 12, 2023

Sources

mroedder-d7
Dec 12, 2023

dosubot[bot]
bot Dec 12, 2023