How to build a retriever that filters documents by metadata #9536

aiwalter · 2023-12-14T23:25:47Z

aiwalter
Dec 14, 2023

I have a query engine that gets for example similarity_top_k=20, I have documents and some of the documents have the same doc.metadata["file_name"] . Now I want to build a retriever that will in case several documents among the top 20 share the file_name, the retriever will take max 5 documents that share the same metadata["file_name"] and will instead take other documents that have a different file_name

Answered by dosubot[bot]

Dec 15, 2023

🤖

Sure, I can guide you on how to modify the BaseRetriever to implement the functionality you described.

First, you need to override the _retrieve method in your subclass of BaseRetriever. This method should return a list of NodeWithScore objects. Each NodeWithScore object contains a node and a score. The node is an instance of the Node class, which has a metadata attribute that is a dictionary. The score is a float that represents the relevance of the node to the query.

Here is a simplified version of how you can implement this:

from llama_index.core.base_retriever import BaseRetriever
from llama_index.core.node import NodeWithScore
from typing import List, Dict
from collections import d…

View full answer

@aiwalter · 2023-12-14T23:33:39Z

dosubot[bot]
bot Dec 14, 2023

🤖

Hey there @aiwalter! Good to see you back in action here, I hope you're doing well. Let's delve into this new challenge you've got for us!

Based on your requirements, you can modify the _simple_fusion method in the QueryFusionRetriever class to limit the number of documents with the same metadata["file_name"] to a maximum of 5 among the top 20 results returned by your query engine.

Here's how you can do it:

def _simple_fusion(
    self, results: Dict[Tuple[str, int], List[NodeWithScore]]
) -> List[NodeWithScore]:
    """Apply simple fusion."""
    # Use a dict to de-duplicate nodes
    all_nodes: Dict[str, NodeWithScore] = {}
    file_name_counts: Dict[str, int] = {}
    for nodes_with_scores in results.values():
        for node_with_score in nodes_with_scores:
            text = node_with_score.node.get_content()
            file_name = node_with_score.node.metadata["file_name"]
            if file_name not in file_name_counts:
                file_name_counts[file_name] = 0
            if file_name_counts[file_name] < 5:
                all_nodes[text] = node_with_score
                file_name_counts[file_name] += 1

    return sorted(all_nodes.values(), key=lambda x: x.score or 0.0, reverse=True)[:20]

In this modification, we keep track of the count of each file_name in the file_name_counts dictionary. If the count of a file_name is less than 5, we add the node_with_score to all_nodes. This ensures that no more than 5 documents with the same file_name are included in the results. Finally, we return the top 20 results sorted by score.

Please note that this modification assumes that the metadata of each NodeWithScore contains a file_name key. If the key is different, please adjust the code accordingly.

Regarding the structure of the NodeWithScore object, I wasn't able to find specific information within the LlamaIndex repository. However, based on the usage in the code, it seems to be an object that contains a node and a score. The node itself appears to have methods for getting its content (get_content()) and metadata (metadata). If you need more specific information about this object, could you please provide more details about what you're looking for?

You can find the original _simple_fusion method in the QueryFusionRetriever class in the fusion_retriever.py file.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

llama_index/retrievers/fusion_retriever.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

2 replies

aiwalter Dec 15, 2023
Author

@dosu-bot but I dont want to do a fusion query. Can you adjust the BaseRetriever to do this what you described in the _simple_fusion method?

dosubot[bot] bot Dec 15, 2023

🤖

Sure, I can guide you on how to modify the BaseRetriever to implement the functionality you described.

First, you need to override the _retrieve method in your subclass of BaseRetriever. This method should return a list of NodeWithScore objects. Each NodeWithScore object contains a node and a score. The node is an instance of the Node class, which has a metadata attribute that is a dictionary. The score is a float that represents the relevance of the node to the query.

Here is a simplified version of how you can implement this:

from llama_index.core.base_retriever import BaseRetriever
from llama_index.core.node import NodeWithScore
from typing import List, Dict
from collections import defaultdict

class CustomRetriever(BaseRetriever):
    def _retrieve(self, query_bundle):
        # This is a placeholder. Replace this with your actual retrieval logic.
        nodes_with_scores = self._get_nodes_with_scores(query_bundle)

        # Apply the simple fusion logic
        return self._simple_fusion(nodes_with_scores)

    def _simple_fusion(self, nodes_with_scores: List[NodeWithScore]) -> List[NodeWithScore]:
        all_nodes: Dict[str, NodeWithScore] = {}
        file_name_counts: Dict[str, int] = defaultdict(int)

        for node_with_score in nodes_with_scores:
            file_name = node_with_score.node.metadata["file_name"]
            if file_name_counts[file_name] < 5:
                all_nodes[node_with_score.node.get_content()] = node_with_score
                file_name_counts[file_name] += 1

        return sorted(all_nodes.values(), key=lambda x: x.score or 0.0, reverse=True)[:20]

In this code, _retrieve is the method that you need to implement to retrieve nodes based on the query. The _simple_fusion method is the one that applies the logic you described. It keeps track of the count of each file_name in a dictionary. If the count of a file_name is less than 5, it adds the node_with_score to a dictionary of nodes. This ensures that no more than 5 documents with the same file_name are included in the results. Finally, it returns the top 20 results sorted by score.

Please replace the _get_nodes_with_scores method with your actual retrieval logic. This method should return a list of NodeWithScore objects based on the query.

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Answer selected by aiwalter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to build a retriever that filters documents by metadata #9536

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to build a retriever that filters documents by metadata #9536

aiwalter Dec 14, 2023

Replies: 1 comment · 2 replies

dosubot[bot] bot Dec 14, 2023

Sources

aiwalter Dec 15, 2023 Author

dosubot[bot] bot Dec 15, 2023

Sources

aiwalter
Dec 14, 2023

Replies: 1 comment 2 replies

dosubot[bot]
bot Dec 14, 2023

aiwalter Dec 15, 2023
Author