[Question]: Can I prioritize documents/nodes over others? #8376

thedoor76 · 2023-08-03T11:56:58Z

thedoor76
Aug 3, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

In my setup, several PDF documents pertain to a legal case. We have hundreds of cases, resulting in thousands of PDF documents.
I am currently successful in indexing each document, assigning the right metadata (such as case number), and querying them using filters.

In order to improve response accuracy, I have built a custom text document for each case called a "fact sheet". That document will contain key details about each case. We know that this "fact sheet" document will contain many answers to the questions asked.

So far, I was able to add this "fact sheet" document for each legal case, and index it. However, I am noticing that it is often passed over when questions are asked, even though it contains the answer.

How could I give my "fact sheet" documents a slight priority over the other PDF documents related to that case?

I have looked at Composability, but I don't think that's quite what I need, as the fact sheet does not necessarily contain a written summary of the case; instead, it simply holds a lot of key facts about the case.

Thanks in advance.

@dosu-bot · 2023-10-24T06:45:11Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

Thank you for your detailed question. It's great to hear that you've been able to successfully index and query your documents using LlamaIndex.

To prioritize your "fact sheet" documents over other PDF documents, you can modify the _add_nodes_to_index method in the DocumentSummaryIndex class. This method is responsible for generating a summary for each document and adding it to the index. You can add a prioritization mechanism here by assigning a higher weight to "fact sheet" documents when generating the summary.

Here is a hypothetical example of how you might modify the _add_nodes_to_index method to prioritize "fact sheet" documents:

def _add_nodes_to_index(
    self,
    index_struct: IndexDocumentSummary,
    nodes: Sequence[BaseNode],
    show_progress: bool = False,
) -> None:
    """Add nodes to index."""
    doc_id_to_nodes = defaultdict(list)
    for node in nodes:
        if node.ref_doc_id is None:
            raise ValueError(
                "ref_doc_id of node cannot be None when building a document "
                "summary index"
            )
        doc_id_to_nodes[node.ref_doc_id].append(node)
    summary_node_dict = {}
    items = doc_id_to_nodes.items()
    iterable_with_progress = get_tqdm_iterable(
        items, show_progress, "Summarizing documents"
    )

    for doc_id, nodes in iterable_with_progress:
        print(f"current doc id: {doc_id}")
        nodes_with_scores = [NodeWithScore(node=n) for n in nodes]
        # get the summary for each doc_id
        summary_response = self._response_synthesizer.synthesize(
            query=self._summary_query,
            nodes=nodes_with_scores,
        )
        summary_response = cast(Response, summary_response)
        summary_node_dict[doc_id] = TextNode(
            text=summary_response.response,
            relationships={
                NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id)
            },
        )
        # Prioritize "fact sheet" documents
        if is_fact_sheet(doc_id):  # You would need to implement this function
            summary_node_dict[doc_id].weight = 2.0  # Assign a higher weight
        else:
            summary_node_dict[doc_id].weight = 1.0  # Assign a normal weight
        self.docstore.add_documents([summary_node_dict[doc_id]])
        logger.info(
            f"> Generated summary for doc {doc_id}: " f"{summary_response.response}"
        )

    for doc_id, nodes in doc_id_to_nodes.items():
        index_struct.add_summary_and_nodes(summary_node_dict[doc_id], nodes)

    if self._embed_summaries:
        embed_model = self._service_context.embed_model
        summary_nodes = list(summary_node_dict.values())
        id_to_embed_map = embed_nodes(
            summary_nodes, embed_model, show_progress=show_progress
        )

        summary_nodes_with_embedding = []
        for node in summary_nodes:
            node_with_embedding = node.copy()
            node_with_embedding.embedding = id_to_embed_map[node.node_id]
            summary_nodes_with_embedding.append(node_with_embedding)

        self._vector_store.add(summary_nodes_with_embedding)

Please note that this is a hypothetical example and might not work as-is. You would need to adapt it to your specific use case and data. The is_fact_sheet(doc_id) function is a placeholder for a function that you would need to implement to identify "fact sheet" documents.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Can I prioritize documents/nodes over others? #8376

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[Question]: Can I prioritize documents/nodes over others? #8376

thedoor76 Aug 3, 2023

Question Validation

Question

Replies: 1 comment

dosubot[bot] bot Oct 24, 2023

Sources

thedoor76
Aug 3, 2023

dosubot[bot]
bot Oct 24, 2023