[Question]: Knowledge Graph Vs RAG #8393

vishnu9000 · 2023-08-28T07:44:24Z

vishnu9000
Aug 28, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi guys I am new to llama index and llm. I am working on an use case for document question answering and I am confused on which path to take. I am lot of excel files that contain test case details like component, test case id, test case name, test case steps, expected results. I want to create a QA system such that it should be able to answer following questions:

Is this bug covered by any test case?
Is there any similar test case related to this?
List me all test case that need to be tested for a component.
Give me list of minimum test cases that need to be tested when change x happens?

Here first 2 questions can be done with simple vector store. But last two cannot be achieved by vector store due to limitations of vector store RAG where chunks are save. Let say I said give me test case for y component. The pipeline will retrieve 5 best matched chunk. But I may have 20 such test cases. And for last question the model might need to go to multiple chunks to get the accurate results. In my case for each test case component, test case id, test case name, test case steps, expected results is a chunk so itself will be having a good amount of size.

That when I thought of Knowledge Graph and looked into llama index framework. I am using local llama 2 quantized model for security reasons of data. I have checked out following sections.

General Knowledge Graph. Here I saw the graph is not getting generated with all the information in text. It skips of texts and really confusing how the graph is generated. Following is an example taken from llama index doc itself:

I have tried to add triplets manually as in docs
node_0_tups = [
("author", "worked on", "writing"),
("author", "worked on", "programming"),
]
for tup in node_0_tups:
index.upsert_triplet_and_node(tup, nodes[0])

But this the graph its generating, I am really confused by how the relation ship became like this.

I also tried with local model for extracting triplets from text like REBEL but still all information are not extracted properly.

I was hoping KG could help with with all the use case I have listed above. Can some one help me with this. I am stuck. Am I doing anything wrong? Is this not the correct approach?

@dosu-bot · 2023-08-28T07:58:56Z

dosubot[bot]
bot Aug 28, 2023

🤖

Hello,

Thank you for your interest in LlamaIndex and for your detailed question. It seems like you're on the right track with your approach to using a Knowledge Graph (KG) for your use case.

To address your concerns:

Missing Information in the Graph: The graph generation process in LlamaIndex is based on extracting triplets from the text. This process might not capture all the information in the text, especially if the information is not in a format that can be easily converted into a triplet. This is a known limitation of the current version of LlamaIndex.
Adding Triplets Manually: The example you provided is a manual way of adding triplets to the graph. The resulting graph might not make sense if the triplets are not related or if the relationships between the nodes are not clearly defined.
Using Local Model for Extracting Triplets: The effectiveness of the local model in extracting triplets can vary depending on the quality and format of the input text. If the text is not well-structured or if it contains complex sentences, the model might not be able to extract all the relevant triplets.

For your use case, you might need to preprocess your data to ensure that it is in a format that can be easily parsed by the model. This might involve converting your excel data into a structured format like JSON or CSV, and structuring your text in a way that makes it easier for the model to extract the relevant triplets.

Here's a sample code snippet on how you can use LlamaIndex for your use case:

from llama_index import (
    KnowledgeGraphIndex,
    LLMPredictor,
    ServiceContext,
    SimpleDirectoryReader,
)
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore
from llama_index.llms import OpenAI

llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size_limit=512)

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

from llama_index.query_engine import KnowledgeGraphQueryEngine

query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

response = query_engine.query(
    "Your question here",
)

Please replace the placeholders with your actual data and parameters.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

@dosu-bot · 2023-08-28T07:58:56Z

dosubot[bot]
bot Aug 28, 2023

🤖

Hello,

Thank you for your detailed question. It seems like you're on the right track with using LlamaIndex's Knowledge Graph (KG) for your use case.

For your first two questions, as you mentioned, a simple vector store or even a traditional information retrieval system might suffice. However, for the last two questions, you're correct that a more complex system like a KG would be beneficial due to the need to retrieve multiple related chunks of information.

In terms of the issues you're facing with the KG, it seems like there might be some confusion about how the graph is generated and how triplets are extracted from the text.

The graph is generated based on the triplets that are extracted from the text. Each triplet consists of a subject, predicate, and object, which form the nodes and edges of the graph. The upsert_triplet_and_node method is used to insert these triplets into the graph. If the graph is not being generated as expected, it might be due to the way the triplets are being extracted or inserted.

The extraction of triplets is handled by the _extract_triplets method in the KnowledgeGraphIndex class. This method uses a function (_kg_triplet_extract_fn) if provided, otherwise it uses the _llm_extract_triplets method. The _llm_extract_triplets method uses the LLM predictor to predict triplets from the text based on the kg_triple_extract_template. The predicted response is then parsed by the _parse_triplet_response method to extract the triplets.

There are a few limitations to this process that might be causing the issues you're seeing:

The maximum number of triplets that can be extracted per chunk is defined by the max_triplets_per_chunk parameter. If the text contains more triplets, they will not be extracted.
The maximum length of the object in a triplet is defined by the max_object_length parameter. If the object in a triplet exceeds this length, the triplet will be skipped.
The extraction process relies on the LLM predictor and the kg_triple_extract_template. If the predictor or the template is not accurate or comprehensive, the extraction process may not be able to extract all the triplets correctly.
The extraction process only extracts triplets that are properly formatted (i.e., enclosed in parentheses and separated by commas). If a triplet is not properly formatted, it will be skipped.

You can find more details about this process in the KnowledgeGraphIndex class.

As for the image you've shared, I'm sorry but I'm unable to view images. Could you please describe the issue you're facing with the graph in more detail?

I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

vishnu9000 · 2023-08-28T09:19:53Z

vishnu9000
Aug 28, 2023
Author

@dosu-bot So for my point 3 & 4 is this the best approach? I have tried to manually add triplets in graph

node_0_tups = [
("author", "worked on", "writing"),
("author", "worked on", "programming"),
]
for tup in node_0_tups:
index.upsert_triplet_and_node(tup, nodes[0])

this is the same code from llama-index I am using. But the generated graph is having a relation like this: the two nodes are author and worked on and in arrow its showing relation as author too. What I have seen is that when inserting triplets the first and second elements is taken as node and the first elements is taken as relation for all cases. Is that the correct method in reality should'nt first and third element be taken as node and the second elements as relation? Also what is the use of inserting nodes[0] here-index.upsert_triplet_and_node(tup, nodes[0])

For creating KG:

Either we have to manually enter triplets (in my case I was thinking of entering test case id, "has following steps", test case steps, also component,"has following test case",test case name). Is this right approach?
If I am using llm for extracting triplets either with REBEL as in llama index docs or some other llm model llama2 or gpt would that be better than manually entering data in my case. One issue with I have restriction in running local models only due to data security and I am having a CPU system so thats why I was running llama 2 quanitesd model. Will that work here too?
If I am using an external graph database and integrating it with llama-index, how to create the graph with my data in that database. There also should I manually add it or is there any method specific for that?

0 replies

logan-markewich · 2023-08-28T16:37:44Z

logan-markewich
Aug 28, 2023
Maintainer

@vishnu9000 hmm, I think it's just a bug with drawing the graph tbh, especially since we have unit tests for upserting triplets like that.

0 replies

vishnu9000 · 2023-08-29T02:38:02Z

vishnu9000
Aug 29, 2023
Author

@logan-markewich ohh..so is this a bug with llama index? Any way to rectify it? If this is the case then for other KG options in llama index same bug will be there right?

0 replies

wey-gu · 2023-08-31T06:20:12Z

wey-gu
Aug 31, 2023

@vishnu9000 could you give it a try on NebulaGraph first?

that's the main graph store I kept changing/iterating/optimizing towards.

Will spend some time polishing the SimpleStore and the mutation of the kg_index later.

0 replies

vishnu9000 · 2023-08-31T06:29:33Z

vishnu9000
Aug 31, 2023
Author

@wey-gu Thanks for the info I will check it out. This might be a dumb question but how to create graph from text for nebulagraph? Should it be done manually? or use an llm for extracting entity, wouldn't that take too much time? I have 100 excels with test case id, component, test case name, steps , results how to create a graph entity from this?

1 reply

dosubot[bot] bot Oct 24, 2023

🤖

Based on your requirements and the context provided, here's a suggested approach to create a Knowledge Graph (KG) using NebulaGraph and LlamaIndex from your Excel data:

Data Preparation: Convert your Excel data into a format that can be ingested by NebulaGraph, such as CSV. You can use Python libraries like pandas for this.
Data Ingestion: Use NebulaGraph's data import tools to ingest your data into NebulaGraph.
Index Creation: Use LlamaIndex to create an index on your data. You can use the KnowledgeGraphIndex class provided by LlamaIndex for this. This class takes a service_context and storage_context as arguments, which you can create based on your local setup.
Entity Extraction and Synonym Expansion: Use the methods provided by the KnowledgeGraphIndex class to extract entities from your data and expand synonyms for these entities. This can help improve the quality of your index and make your data more searchable.
Querying: Use the KnowledgeGraphQueryEngine class to query your data. This class allows you to query your data using natural language.

Here is a sample code snippet to create an index and query your data:

from llama_index import (
    KnowledgeGraphIndex,
    ServiceContext,
    StorageContext,
    NebulaGraphStore,
    KnowledgeGraphQueryEngine
)

# Define your LLM
# This could be an OpenAI model or a local Llama 2 quantized model
llm = ...

# Create a ServiceContext
service_context = ServiceContext.from_defaults(llm=llm)

# Create a NebulaGraphStore
graph_store = NebulaGraphStore(...)

# Create a StorageContext
storage_context = StorageContext.from_defaults(graph_store=graph_store)

# Create a KnowledgeGraphIndex
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

# Create a KnowledgeGraphQueryEngine
query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

# Query your data
response = query_engine.query("your query here")

Please note that this is a high-level overview and the actual implementation might require additional steps based on your specific use case and data. Also, the LlamaIndex code provided in the context does not provide a direct way to use a local Llama 2 quantized model. You might need to modify the LlamaIndex code or use additional tools to use a local Llama 2 quantized model due to data security and system limitations.

For more details, you can refer to the following sources:

Regarding your concern about the time it might take due to the volume of your data, the KnowledgeGraphIndex class in LlamaIndex uses several optimization techniques to handle large volumes of data during index creation, such as keyword-table based indexing, chunking, embeddings, customizable triplet extraction function, and progress tracking. You can refer to the llama_index/indices/knowledge_graph/base.py for more details.

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Knowledge Graph Vs RAG #8393

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: Knowledge Graph Vs RAG #8393

vishnu9000 Aug 28, 2023

Question Validation

Question

Replies: 7 comments · 1 reply

dosubot[bot] bot Aug 28, 2023

Sources

dosubot[bot] bot Aug 28, 2023

Sources

vishnu9000 Aug 28, 2023 Author

logan-markewich Aug 28, 2023 Maintainer

vishnu9000 Aug 29, 2023 Author

wey-gu Aug 31, 2023

vishnu9000 Aug 31, 2023 Author

dosubot[bot] bot Oct 24, 2023

Sources

vishnu9000
Aug 28, 2023

Replies: 7 comments 1 reply

dosubot[bot]
bot Aug 28, 2023

dosubot[bot]
bot Aug 28, 2023

vishnu9000
Aug 28, 2023
Author

logan-markewich
Aug 28, 2023
Maintainer

vishnu9000
Aug 29, 2023
Author

wey-gu
Aug 31, 2023

vishnu9000
Aug 31, 2023
Author