Issues summarising large CSV files #8946

jayh-aa · 2023-11-16T06:33:26Z

jayh-aa
Nov 16, 2023

Hi all, I am trying to load some data from a CSV and ask the LLM to summarise it. The CSV looks as follows:

ID,Property1,Property2,Property3,Outcome
ID1,1A,2B,3C,Passed
ID2,1B,2B,3C,Failed
...

And using the following code snippets:

import os

os.environ["OPENAI_API_KEY"] = ...

from llama_index import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)

from pathlib import Path
from llama_index import download_loader

SimpleCSVReader = download_loader("SimpleCSVReader")
loader = SimpleCSVReader(encoding="utf-8")
documents = loader.load_data(file=Path('./Data.csv'))

index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist()

query_engine = index.as_query_engine()
response = query_engine.query("Can you summarise ID1?")
print(response)

When the dataset is small, it is able to come back with the correct answer that ID1 has the following properties: 1A, 2B, 3C, Passed (or something similar). However, when the dataset is too big (about 100 rows or so), it often tells me that is unable to locate the ID in the data, even though the ID is clearly one of the rows there. I have tried some text prompts such as adding a text file stating which columns are which, and that each data is in its own row etc but this issue still occurs. Could anyone give me some guidance?

@dosu-bot · 2023-11-16T06:44:11Z

dosubot[bot]
bot Nov 16, 2023

🤖

Based on the information you've provided, it seems like the issue you're encountering is related to how the data is being loaded and indexed. When the dataset is small, all the data can be loaded into memory and indexed without any issues. However, when the dataset is large, it may not be possible to load all the data into memory at once. This could result in only a portion of the data being indexed, which would explain why specific IDs cannot be located.

The load_data function in the SimpleCSVReader class, which you're using to load your data, does not handle large datasets in chunks. Instead, it loads the entire dataset into memory at once. This is evident from the source code of the SimpleCSVReader class:

with open(file, "r") as fp:
    csv_reader = csv.reader(fp)
    for row in csv_reader:
        text_list.append(", ".join(row))

As you can see, the load_data function reads the CSV file line by line using the csv.reader function and appends each row to the text_list list. This means that the entire dataset is loaded into memory at once.

To fix this issue, you might need to modify the load_data function to handle large datasets. This could involve loading and indexing the data in chunks, rather than all at once. You could use Python's built-in csv module to read the CSV file in chunks. Here's an example of how you could modify the load_data function:

def load_data(self, file, chunksize=100):
    text_list = []
    with open(file, "r") as fp:
        csv_reader = csv.reader(fp)
        chunk = []
        for i, row in enumerate(csv_reader):
            chunk.append(", ".join(row))
            if (i + 1) % chunksize == 0:
                text_list.append(chunk)
                chunk = []
        if chunk:
            text_list.append(chunk)
    return text_list

In this modified version of the load_data function, the CSV file is read in chunks of chunksize rows at a time. Each chunk is appended to the text_list list, which is then returned by the function. This should allow you to load and index large datasets without running into memory issues.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

jayh-aa Nov 20, 2023
Author

Did not work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues summarising large CSV files #8946

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Issues summarising large CSV files #8946

jayh-aa Nov 16, 2023

Replies: 1 comment · 1 reply

dosubot[bot] bot Nov 16, 2023

Sources

jayh-aa Nov 20, 2023 Author

jayh-aa
Nov 16, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Nov 16, 2023

jayh-aa Nov 20, 2023
Author