[Question]: How chunk_size , chunk_overlap and text spiliter work? #8373

284nnuS · 2023-07-28T02:42:42Z

284nnuS
Jul 28, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hello,

I have dive in your docs but i have some issue related to chunk and text spilt. As I think, the text split will split 1 document to many node and will store in docstore. But when i trying to test with the file with long text. It doesn't happen.
Here is my code to trying to setting the Text spiliter :
text_splitter = TokenTextSplitter( separator=" ", chunk_size=5, chunk_overlap=1, backup_separators=["\n"] ) parser= SimpleNodeParser( text_splitter=text_splitter ) index = VectorStoreIndex( documents, service_context=service_context, show_progress=True )
Please help me

@dosu-bot · 2023-10-24T06:46:32Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue.

The TokenTextSplitter class in LlamaIndex is designed to split text into chunks based on tokens. The chunk_size parameter specifies the token chunk size for each chunk, and chunk_overlap specifies the token overlap of each chunk when splitting.

In your case, you've set chunk_size to 5 and chunk_overlap to 1. This means that the text will be split into chunks of 5 tokens each, with an overlap of 1 token between each chunk. The separator parameter is set to " ", which means that the text will be split into tokens based on spaces. The backup_separators parameter is set to ["\n"], which means that if the separator does not result in any splits, the text will be split based on newline characters.

If your long text file is not being split into multiple nodes as expected, it could be due to the size of your text or the settings of your chunk_size and chunk_overlap. If the text is too short (less than chunk_size), it will not be split. If the chunk_size is too large compared to the size of your text, the text might not be split as expected.

Also, please note that if the metadata length is longer than the chunk size, an error will be raised. If the resulting chunks are less than 50 tokens, a warning will be printed. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

Here is a reference to a similar issue that was solved: [Question]: How to split or seperate nodes Doucment data?. The solution was to implement a function to split large documents into smaller chunks before passing the documents to the from_documents method or decrease the chunk size in the service context.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How chunk_size , chunk_overlap and text spiliter work? #8373

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[Question]: How chunk_size , chunk_overlap and text spiliter work? #8373

284nnuS Jul 28, 2023

Question Validation

Question

Replies: 1 comment

dosubot[bot] bot Oct 24, 2023

Sources

284nnuS
Jul 28, 2023

dosubot[bot]
bot Oct 24, 2023