RAG not able to read .json files #9049

LorenzoCutrupi · 2023-11-21T09:59:37Z

LorenzoCutrupi
Nov 21, 2023

I have written this very basic rag:

documents = SimpleDirectoryReader("/content/Data/").load_data()
llm = LlamaCPP(
model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
...,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
embed_model = HuggingFaceEmbeddings(
model_name="thenlper/gte-large"
)
service_context = ServiceContext.from_defaults(
chunk_size=256,
llm=llm,
embed_model=embed_model
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("What's the quantity of Xbox 720 available?")
print(response)

Where I'm using llama_index for all but the embed_model because colab had run issues with llama_index.HuggingFaceEmbedding. Now, I have this problem: the files from the folder Data are mostly .pdf but there are some .json. Why the rag is able to answer only about pdf? I tested with multiple prompts and it seems like the json are not read at all, but I'm sure that documents contain such files too, so I don't know where is the problem and how I can solve it.

Answered by LorenzoCutrupi

Nov 21, 2023

Taking inspiration from this reply (which is basically the answer), I rewrote the code for extracting documents in this way:

directory_path = "/content/Data/"
documents = SimpleDirectoryReader(directory_path).load_data()

json_docs = []
for filename in os.listdir(directory_path):
    filepath = os.path.join(directory_path, filename)
    
    # Check if the file is a JSON file
    if filename.endswith(".json") and os.path.isfile(filepath):
        # Load data using JSONReader for JSON files
        json_data = JSONReader().load_data(filepath)
        json_docs.extend(json_data)

merged_docs = documents + json_docs

So basically the first line of the code in the question is now substitued w…

View full answer

@LorenzoCutrupi · 2023-11-21T10:10:22Z

dosubot[bot]
bot Nov 21, 2023

🤖

Hello @LorenzoCutrupi,

Nice to meet you! I'm an AI developed to assist you with your technical questions, bug fixes, and guide you in becoming a more active contributor. While we're waiting for a human maintainer, I'm here to help you. Let's dig into your issue together!

The issue you're experiencing is due to the fact that the SimpleDirectoryReader in LlamaIndex is not configured to read .json files by default. The SimpleDirectoryReader uses a dictionary called DEFAULT_FILE_READER_CLS to map file extensions to their respective reader classes, and .json is not included in this mapping.

To enable reading .json files, you would need to add a JSONReader to the DEFAULT_FILE_READER_CLS dictionary. However, this would require you to modify the LlamaIndex source code, which may not be ideal.

Alternatively, you can use the JSONReader class provided by LlamaIndex to read your .json files separately and then combine the data with the data read by the SimpleDirectoryReader. Here is an example of how to use the JSONReader class:

from llama_index.readers.json import JSONReader

# Create an instance of JSONReader
reader = JSONReader()

# Use the instance to load data from a .json file
data = reader.load_data("<path_to_your_json_file>")

You can find more information about the JSONReader class in the test_json.py file in the LlamaIndex repository.

Please note that you would need to handle the merging of the data read by the SimpleDirectoryReader and the JSONReader in a way that is appropriate for your specific use case.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

LorenzoCutrupi Nov 21, 2023
Author

Taking inspiration from this reply (which is basically the answer), I rewrote the code for extracting documents in this way:

directory_path = "/content/Data/"
documents = SimpleDirectoryReader(directory_path).load_data()

json_docs = []
for filename in os.listdir(directory_path):
    filepath = os.path.join(directory_path, filename)
    
    # Check if the file is a JSON file
    if filename.endswith(".json") and os.path.isfile(filepath):
        # Load data using JSONReader for JSON files
        json_data = JSONReader().load_data(filepath)
        json_docs.extend(json_data)

merged_docs = documents + json_docs

So basically the first line of the code in the question is now substitued with these

Answer selected by LorenzoCutrupi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG not able to read .json files #9049

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

RAG not able to read .json files #9049

LorenzoCutrupi Nov 21, 2023

Replies: 1 comment · 1 reply

dosubot[bot] bot Nov 21, 2023

Sources

LorenzoCutrupi Nov 21, 2023 Author

LorenzoCutrupi
Nov 21, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Nov 21, 2023

LorenzoCutrupi Nov 21, 2023
Author