[Question]: How can I improve responses? #8377

kylemassimilian · 2023-08-04T20:56:11Z

kylemassimilian
Aug 4, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

I'm running into issues with the response quality I'm getting querying across documents. Currently testing out two batches of PDFs, one 11MB in size and the other 72. When I reference specific file names, such as "October 2022 update," I'm getting answers w/ information from the wrong file. I also cannot get correct answers to questions like "when was the last time we said something about x?" or "what date did we send xyz?" They are not all date-related issues, however. Another example: If I say "what have we said about y?" I receive a description of the wrong entity. Just giving examples for color.

Can someone help me understand the forces at play here? I know it's not probably possible to achieve correct answers 100% of the time, but my performance as of now is poor. I believe my index needs to be more robust to handle the complex queries I'm asking across many documents. Here is my current plan of things to try:

Configure query settings/ vector index retriever (or use auto retriever) - Can I do anything here to influence the output such as specify top k embeddings or use metadata?
Try different embedding models
Test a custom retriever with keyword and vector indices
Test a custom retriever with knowledge graph and vector indices

An additional idea I had was to use different indices/composability as I scale up knowledge base size, but that doesn't apply directly to the problem at hand since these issues are happening across a relatively small # of files.

Are there other things I should try to test or incorporate? I'm a beginner but eager to learn more. Thank you.

`def generate_or_return_embeddings():
    #Load files
    documents = SimpleDirectoryReader('data').load_data()
    
    #Define embedding model
    embed_model = LangchainEmbedding(
        HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"),
        embed_batch_size=10
    )
    #Define service context - bundle of commonly used resources using during indexing/querying stage in pipeline
    service_context = ServiceContext.from_defaults(embed_model=embed_model) 

    #Connect to Chroma
    chroma_db = chromadb.HttpClient(host="", port=8000)
    #Name the collection
    collection_name = "GPIF-mini-2"
    collections = [col.name for col in chroma_db.list_collections()]
    
    if collection_name in collections:
        print("Collection exists. Building index..")
        chroma_collection = chroma_db.get_collection(collection_name)
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
        index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)
        return index    
    else:
        print("New collection. Generating embeddings then building index...")
        chroma_collection = chroma_db.create_collection(collection_name)
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
        #Abstraction for storing Nodes, indices, and vectors
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        index = VectorStoreIndex.from_documents(
            documents, storage_context=storage_context, service_context=service_context, embed_model=embed_model
        )
        return index



def query_chroma(index, query):
    query_engine = index.as_query_engine()

    response = query_engine.query(query)
    return response`

logan-markewich · 2023-08-07T01:25:05Z

logan-markewich
Aug 7, 2023
Maintainer

I think a custom retriever will work best here. I'm imagining something that can use the metadata or node relationships to pull in the most relevant nodes.

There's definitely quite a range of different types of queries here (some may need keywords, some may need to read the entire index, some may work fine with vector search)

We actually JUST released a retriever router, maybe you'll find this useful

https://github.com/jerryjliu/llama_index/blob/main/docs/examples/retrievers/router_retriever.ipynb

https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/retriever/root.html

0 replies

kylemassimilian · 2023-08-07T15:17:32Z

kylemassimilian
Aug 7, 2023
Author

@logan-markewich Thank you. This is great. I watched this https://www.youtube.com/watch?v=njzB6fm0U8g a few days ago and would love to know your thoughts on which approach to pursue first. When I watched this, my initial thought was that using Docugami's knowledge graph structure and attached metadata will probably do it, but creating a custom retriever like this seems to be more applicable for handling queries that require disparate information. If you were me, would you start with creating a retriever and then pairing this with a metadata approach?

0 replies

kylemassimilian · 2023-08-08T20:41:37Z

kylemassimilian
Aug 8, 2023
Author

Implemented the router and had to use a prompt to force the selector to use the keyword retriever (aside: PydanticMultiSelector would not choose multiple retrievers, but LLMMultiSelector did). Big improvement but in testing I'm still not fetching very many relevant nodes. I'm testing specifically this question: "What were some high points from the February 2023 Monthly Update?" but only getting 1 node back with the word "February" in it while using LLMRerank. I considered using a KeywordNodePostprocessor, but this doesn't seem flexible if you have to specify words for each unique query. I can see in my logs that are two identical nodes with text that say "In our February update , we reported that..." but these nodes are not being chosen by the LLMRerank.

Can someone explain why the retriever wouldn't be retrieving the other nodes with "February 2023" in them over nodes that do not have this word? Is it simple bc the other nodes have "2023," so these are also being retrieved? There are other nodes with my keywords in them, as I can see them in the logs. In other words, when I do not specify top_k=10 in keyword_retriever = keyword_index.as_retriever(top_k=10), I can see a ton of nodes. Some are desirable (have "February") while others do not. Not sure why the ones without February are being fetched.
Does anyone have ideas on what to change in terms of the retriever and/or node postprocessor? Maybe just simplify this to a hybrid keyword + vector or knowledge graph + vector search? Puzzled as to why the LLMRerank isn't choosing all the nodes with "February." Only one with "February" as a title, so it did not find answer: "The documents provided do not contain information relevant to the February 2023 Monthly Update."

`def router(vector_index, list_index, keyword_index, query):

    logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
    logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

    # Define the retrievers
    list_retriever = list_index.as_retriever()
    vector_retriever = vector_index.as_retriever()
    keyword_retriever = keyword_index.as_retriever(top_k=10)

    # Define the tools to choose from.
    list_tool = RetrieverTool.from_defaults(
        retriever=list_retriever,
        description="Use for summarization-type questions related to the data source. Don't use if the question only requires more specific context.",
    )
    vector_tool = RetrieverTool.from_defaults(
        retriever=vector_retriever,
        description="Useful for retrieving specific context that is not date-related or comes from disparate sources of information.",
    )
    keyword_tool = RetrieverTool.from_defaults(
        retriever=keyword_retriever,
        description="Treat this as the top choice and always use it in conjunction with the other two.",
    )
    
    # Define selector module for routing
    retriever = RouterRetriever(
        selector=LLMMultiSelector.from_defaults(), 
        retriever_tools=[
            list_tool, vector_tool, keyword_tool,
        ],
    )
    
    nodes = retriever.retrieve(query)
    return nodes

def synthesize(query, nodes, reranker_top_n=5, with_reranker=True):

    # Node postprocessing w/ LLM rerank
    query_bundle = QueryBundle(query)

    if with_reranker:
        reranker = LLMRerank(
            choice_batch_size=10, top_n=reranker_top_n, service_context=ServiceContext.from_defaults(llm=llm, chunk_size=1024)
        )
        retrieved_nodes = reranker.postprocess_nodes(nodes, query_bundle)
    
        return retrieved_nodes
    else: 
        return nodes

#Display
def pretty_print(df):
    return df.to_html().replace("\\n", "<br>")
  
def visualize_retrieved_nodes(nodes) -> None:
    result_dicts = []
    for node in nodes:
        node = deepcopy(node)
        node.node.metadata = None
        node_text = node.node.get_text()
        node_text = node_text.replace("\n", " ")

        result_dict = {"Score": node.score, "Text": node_text}
        result_dicts.append(result_dict)

    df = pd.DataFrame(result_dicts)
    
    return df`

INFO:llama_index.retrievers.router_retriever:Selecting retriever 0: Use for summarization-type questions related to the data source..
Selecting retriever 0: Use for summarization-type questions related to the data source..
INFO:llama_index.retrievers.router_retriever:Selecting retriever 2: Treat this as the top choice and always use it in conjunction with the other two..
Selecting retriever 2: Treat this as the top choice and always use it in conjunction with the other two..
INFO:llama_index.indices.keyword_table.retrievers:> Starting query: What were some high points from the February 2023 Monthly Update?
> Starting query: What were some high points from the February 2023 Monthly Update?
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": "A question is provided below. Given the question, extract up to 10 keywords from the text. Focus on extracting the keywords that we can 
use to best lookup answers to the question. Avoid stopwords.\\n---------------------\\nWhat were some high points from the February 2023 Monthly Update?\\n---------------------\\nProvide keywords in the following comma-separated format: \'KEYWORDS: <keywords>\'\\n", "stream": false, "model": "text-davinci-003", "temperature": 0.0, "max_tokens": 4013}' message='Post 
details'
api_version=None data='{"prompt": "A question is provided below. Given the question, extract up to 10 keywords from the text. Focus on extracting the keywords that we can use to best lookup answers to the question. Avoid stopwords.\\n---------------------\\nWhat were some high points from the February 2023 Monthly Update?\\n---------------------\\nProvide keywords in the following comma-separated format: \'KEYWORDS: <keywords>\'\\n", "stream": false, "model": "text-davinci-003", "temperature": 0.0, "max_tokens": 4013}' message='Post details'     
DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=480 request_id=b816ccf65e277e1db927225172e5d55c response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=480 request_id=b816ccf65e277e1db927225172e5d55c response_code=200
DEBUG:llama_index.llm_predictor.base:
KEYWORDS: February, 2023, Monthly, Update, High, Points

KEYWORDS: February, 2023, Monthly, Update, High, Points
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['february', 'update', 'points', 'high', 'monthly', '2023']
query keywords: ['february', 'update', 'points', 'high', 'monthly', '2023']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['update', '2023']
> Extracted keywords: ['update', '2023']

0 replies

kylemassimilian · 2023-08-08T20:58:15Z

kylemassimilian
Aug 8, 2023
Author

It appears that "February" is not part of my extracted keyword list, which I believe is the problem. I'm attempting to use a prompt template in my keyword_retriever, but I'm having trouble enforcing it's purpose.

def router(vector_index, list_index, keyword_index, query):

    logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
    logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

    DEFAULT_QUERY_KEYWORD_EXTRACT_TEMPLATE_TMPL = (
        "A question is provided below. Given the question, extract as many keywords as possible up to {max_keywords} "
        "keywords from the text. Focus on extracting the keywords that we can use "
        "to best lookup answers to the question. Avoid stopwords.\n"
        "---------------------\n"
        "{question}\n"
        "---------------------\n"
        "Provide keywords in the following comma-separated format: 'KEYWORDS: <keywords>'\n"
    )
    DEFAULT_QUERY_KEYWORD_EXTRACT_TEMPLATE = Prompt(
        DEFAULT_QUERY_KEYWORD_EXTRACT_TEMPLATE_TMPL,
        prompt_type=PromptType.QUERY_KEYWORD_EXTRACT,
    )

    # Define the retrievers
    list_retriever = list_index.as_retriever()
    vector_retriever = vector_index.as_retriever()
    keyword_retriever = keyword_index.as_retriever(top_k=5, max_keywords=10,
                                                   query_keyword_extract_template=DEFAULT_QUERY_KEYWORD_EXTRACT_TEMPLATE)

    # Define the tools to choose from
    list_tool = RetrieverTool.from_defaults(
        retriever=list_retriever,
        description="Use for summarization-type questions related to the data source. Don't use if the question only requires more specific context.",
    )
    vector_tool = RetrieverTool.from_defaults(
        retriever=vector_retriever,
        description="Useful for retrieving specific context that is not date-related or comes from disparate sources of information.",
    )
    keyword_tool = RetrieverTool.from_defaults(
        retriever=keyword_retriever,
        description="Treat this as the top choice and always use it in conjunction with the other two.",
    )
    
    # Define selector module for routing
    retriever = RouterRetriever(
        selector=LLMMultiSelector.from_defaults(), 
        retriever_tools=[
            list_tool, vector_tool, keyword_tool,
        ],
    )
    
    nodes = retriever.retrieve(query)
    return nodes

DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=577 request_id=785d9dce303739c431f5bd1e91345bb3 response_code=200
message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=577 request_id=785d9dce303739c431f5bd1e91345bb3 response_code=200
DEBUG:llama_index.llm_predictor.base:
KEYWORDS: February, 2023, Monthly, Update, High, Points

KEYWORDS: February, 2023, Monthly, Update, High, Points
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['update', 'february', 'points', '2023', 'high', 'monthly']
query keywords: ['update', 'february', 'points', '2023', 'high', 'monthly']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['update', '2023']
> Extracted keywords: ['update', '2023']

I specifically said "extract the word February" and it didn't work. Very strange.

0 replies

@dosu-bot · 2023-10-24T06:55:08Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

Thank you for reaching out and providing a detailed description of your issue. It seems like you're dealing with a complex problem related to querying across multiple documents and getting accurate responses. I'll try to provide some guidance based on your current plan and the code you've shared.

Firstly, your plan to experiment with different settings and models is a good approach. Here are some additional suggestions based on your plan:

Configure query settings/ vector index retriever (or use auto retriever): Yes, you can influence the output by specifying top k embeddings or using metadata. For instance, you can adjust parameters like ef_construct, M, and ef_search in the TairVectorStore class to improve the quality of the search results. These parameters are specific to the Hierarchical Navigable Small World (HNSW) index type. Increasing these values can improve the quality of the index and search results at the cost of longer indexing and search time.
Try different embedding models: This is a good idea. The quality of the search results depends on the quality of the embeddings of the documents. You can experiment with different methods to generate the embeddings to see which ones yield better results.
Test a custom retriever with keyword and vector indices: The KeywordTableIndex class extracts keywords from the text and maps each keyword to the node(s) it corresponds to. During query time, the keywords are extracted from the query text, and these keywords are used to index into the keyword table. The retrieved nodes are then used to answer the query. You can adjust parameters like max_keywords_per_chunk to optimize the keyword extraction.
Test a custom retriever with knowledge graph and vector indices: The KnowledgeGraphIndex class allows you to build a knowledge graph by extracting triplets and leveraging the knowledge graph during query time. This can help improve the quality of the search results by providing more context and relationships between the documents. You can adjust parameters like max_triplets_per_chunk and include_embeddings to optimize the knowledge graph.

In addition to these, you might want to consider the following:

Refine Template: As suggested in a similar issue [Help] Providing the full/better scope to the LLM, you can define a refine template to improve the quality of the responses. This template can be used to provide more context to the LLM and refine the responses.
Sub-Indices and Router: As suggested in another similar issue Multi topic Best practice, you can define a sub-index for each document or collection of documents, then define a router (powered by LLM) to automatically choose between the different sub-indices. This approach can help improve the performance of your document query index as the number of documents increases.

I hope these suggestions help you improve the response quality of your LlamaIndex. Please let me know if you have any other questions or need further clarification.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

@dosu-bot · 2023-10-24T07:06:18Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

Thank you for reaching out and providing a detailed description of your issue. I understand that you're experiencing problems with the response quality when querying across documents using LlamaIndex. I'll try to provide some guidance based on your current situation and the code you've shared.

Firstly, your plan of action seems to be on the right track. Here are some additional suggestions based on your points:

Configure query settings/ vector index retriever (or use auto retriever): You can indeed influence the output by specifying top k embeddings or using metadata. For instance, you can adjust the similarity_top_k parameter when calling the query method to control the number of similar vectors to retrieve. You can also use metadata to provide additional context to your queries. However, please note that the effectiveness of these adjustments may vary depending on the specific nature of your data and queries.
Try different embedding models: This is a good idea. The quality of your embeddings can significantly impact the performance of your queries. You might want to experiment with different models and see which one works best for your specific use case.
Test a custom retriever with keyword and vector indices: This could potentially improve the performance of your queries. You can define a custom retriever that uses both keyword and vector indices to retrieve documents. This way, you can leverage the strengths of both types of indices.
Test a custom retriever with knowledge graph and vector indices: Similar to the previous point, this could also improve the performance of your queries. You can define a custom retriever that uses both knowledge graph and vector indices to retrieve documents.

In addition to these points, based on the solved issues in the LlamaIndex repository, you might want to consider the following:

Adjusting Index Parameters: You can experiment with different index types and their parameters to see which ones yield better results. For instance, the Hierarchical Navigable Small World (HNSW) index type is currently being used, and you can adjust its parameters like ef_construct, M, and ef_search to improve the quality of the search results.
Improving Query Filters: You can improve the quality of the search results by providing more accurate and specific filters in your queries.
Optimizing the Embeddings: You can experiment with different methods to generate the embeddings to see which ones yield better results.
Keyword Extraction: You can improve the quality of the search results by refining the keyword extraction process. You can adjust the max_keywords_per_chunk parameter and experiment with different keyword extraction templates.
Triplet Extraction: You can improve the quality of the search results by refining the triplet extraction process. You can adjust the max_triplets_per_chunk parameter and experiment with different triplet extraction templates.
Using LoadAndSearchToolSpec: You can use this tool to load large amounts of data into your index and then search the loaded data. This tool can be used with other tools that load large amounts of information.

I hope these suggestions help you improve the performance of your LlamaIndex. Please let me know if you have any further questions or if there's anything else I can assist you with.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How can I improve responses? #8377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Question]: How can I improve responses? #8377

kylemassimilian Aug 4, 2023

Question Validation

Question

Replies: 6 comments

logan-markewich Aug 7, 2023 Maintainer

kylemassimilian Aug 7, 2023 Author

kylemassimilian Aug 8, 2023 Author

kylemassimilian Aug 8, 2023 Author

dosubot[bot] bot Oct 24, 2023

Sources

dosubot[bot] bot Oct 24, 2023

Sources

kylemassimilian
Aug 4, 2023

logan-markewich
Aug 7, 2023
Maintainer

kylemassimilian
Aug 7, 2023
Author

kylemassimilian
Aug 8, 2023
Author

kylemassimilian
Aug 8, 2023
Author

dosubot[bot]
bot Oct 24, 2023

dosubot[bot]
bot Oct 24, 2023