Replies: 1 comment
-
🤖 Hello, Thank you for your detailed question. It's great to hear that you've been able to successfully index and query your documents using LlamaIndex. To prioritize your "fact sheet" documents over other PDF documents, you can modify the Here is a hypothetical example of how you might modify the def _add_nodes_to_index(
self,
index_struct: IndexDocumentSummary,
nodes: Sequence[BaseNode],
show_progress: bool = False,
) -> None:
"""Add nodes to index."""
doc_id_to_nodes = defaultdict(list)
for node in nodes:
if node.ref_doc_id is None:
raise ValueError(
"ref_doc_id of node cannot be None when building a document "
"summary index"
)
doc_id_to_nodes[node.ref_doc_id].append(node)
summary_node_dict = {}
items = doc_id_to_nodes.items()
iterable_with_progress = get_tqdm_iterable(
items, show_progress, "Summarizing documents"
)
for doc_id, nodes in iterable_with_progress:
print(f"current doc id: {doc_id}")
nodes_with_scores = [NodeWithScore(node=n) for n in nodes]
# get the summary for each doc_id
summary_response = self._response_synthesizer.synthesize(
query=self._summary_query,
nodes=nodes_with_scores,
)
summary_response = cast(Response, summary_response)
summary_node_dict[doc_id] = TextNode(
text=summary_response.response,
relationships={
NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id)
},
)
# Prioritize "fact sheet" documents
if is_fact_sheet(doc_id): # You would need to implement this function
summary_node_dict[doc_id].weight = 2.0 # Assign a higher weight
else:
summary_node_dict[doc_id].weight = 1.0 # Assign a normal weight
self.docstore.add_documents([summary_node_dict[doc_id]])
logger.info(
f"> Generated summary for doc {doc_id}: " f"{summary_response.response}"
)
for doc_id, nodes in doc_id_to_nodes.items():
index_struct.add_summary_and_nodes(summary_node_dict[doc_id], nodes)
if self._embed_summaries:
embed_model = self._service_context.embed_model
summary_nodes = list(summary_node_dict.values())
id_to_embed_map = embed_nodes(
summary_nodes, embed_model, show_progress=show_progress
)
summary_nodes_with_embedding = []
for node in summary_nodes:
node_with_embedding = node.copy()
node_with_embedding.embedding = id_to_embed_map[node.node_id]
summary_nodes_with_embedding.append(node_with_embedding)
self._vector_store.add(summary_nodes_with_embedding) Please note that this is a hypothetical example and might not work as-is. You would need to adapt it to your specific use case and data. The I hope this helps! If you have any further questions, please don't hesitate to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
In my setup, several PDF documents pertain to a legal case. We have hundreds of cases, resulting in thousands of PDF documents.
I am currently successful in indexing each document, assigning the right metadata (such as case number), and querying them using filters.
In order to improve response accuracy, I have built a custom text document for each case called a "fact sheet". That document will contain key details about each case. We know that this "fact sheet" document will contain many answers to the questions asked.
So far, I was able to add this "fact sheet" document for each legal case, and index it. However, I am noticing that it is often passed over when questions are asked, even though it contains the answer.
How could I give my "fact sheet" documents a slight priority over the other PDF documents related to that case?
I have looked at Composability, but I don't think that's quite what I need, as the fact sheet does not necessarily contain a written summary of the case; instead, it simply holds a lot of key facts about the case.
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions