Replies: 1 comment 1 reply
-
🤖 Based on the information you've provided, it seems like the issue you're encountering is related to how the data is being loaded and indexed. When the dataset is small, all the data can be loaded into memory and indexed without any issues. However, when the dataset is large, it may not be possible to load all the data into memory at once. This could result in only a portion of the data being indexed, which would explain why specific IDs cannot be located. The with open(file, "r") as fp:
csv_reader = csv.reader(fp)
for row in csv_reader:
text_list.append(", ".join(row)) As you can see, the To fix this issue, you might need to modify the def load_data(self, file, chunksize=100):
text_list = []
with open(file, "r") as fp:
csv_reader = csv.reader(fp)
chunk = []
for i, row in enumerate(csv_reader):
chunk.append(", ".join(row))
if (i + 1) % chunksize == 0:
text_list.append(chunk)
chunk = []
if chunk:
text_list.append(chunk)
return text_list In this modified version of the I hope this helps! Let me know if you have any other questions. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I am trying to load some data from a CSV and ask the LLM to summarise it. The CSV looks as follows:
ID,Property1,Property2,Property3,Outcome
ID1,1A,2B,3C,Passed
ID2,1B,2B,3C,Failed
...
And using the following code snippets:
import os
os.environ["OPENAI_API_KEY"] = ...
from llama_index import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
from pathlib import Path
from llama_index import download_loader
SimpleCSVReader = download_loader("SimpleCSVReader")
loader = SimpleCSVReader(encoding="utf-8")
documents = loader.load_data(file=Path('./Data.csv'))
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist()
query_engine = index.as_query_engine()
response = query_engine.query("Can you summarise ID1?")
print(response)
When the dataset is small, it is able to come back with the correct answer that ID1 has the following properties: 1A, 2B, 3C, Passed (or something similar). However, when the dataset is too big (about 100 rows or so), it often tells me that is unable to locate the ID in the data, even though the ID is clearly one of the rows there. I have tried some text prompts such as adding a text file stating which columns are which, and that each data is in its own row etc but this issue still occurs. Could anyone give me some guidance?
Beta Was this translation helpful? Give feedback.
All reactions