Why do chunking based on cl100k_base tokens instead of character length? #9454
Replies: 1 comment
-
🤖 Hi @mroedder-d7! Good to see you around here again. Always appreciating your insightful questions. Hope you're doing fantastic! The choice to use The chunking process in LlamaIndex is handled in the As for the comparison between If you find the default chunking method based on I hope this helps! If you have further questions or need more clarification, feel free to ask. Dosu SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Why does Llama-Index do chunking based on
cl100k_base
tokens (per default) instead of character length?To me this is quite unintuitive, as:
cl100k_base
as the tokenizer, but (e.g.)ada-002
. So even there it is not a helpful forecast for the expected space in the future context length.So at best this logic does not gain anything and just uses CPU time, and at the worst this is confusing all users that rely on the defaults.
I would like to try to understand where the motivation to use
cl100k_base
came from and if it still makes sense to keep this default. 🙂Beta Was this translation helpful? Give feedback.
All reactions