Find most related text unit from entities #304

syf0119 · 2024-11-19T10:11:25Z

This is the code that find most related text unit from entities,Method name is _find_most_related_text_unit_from_entities.
If it is the total number of times a text unit is referenced by neighboring nodes, why use continue, see lines 580,581.
Can someone explain this part of the code?
Thank you for your help.

The text was updated successfully, but these errors were encountered:

aiproductguy · 2024-11-23T03:06:05Z

It's creating an efficient index of text units while tracking how interconnected each unit is with other nodes in the graph. The code section:

Build a lookup dictionary (all_text_units_lookup) that contains unique text units and their metadata
For each text unit, calculate:

How many times it's referenced by neighboring nodes (relation_counts)
Its order of appearance (order)
The actual content data (data)

#asked/claude-3-5-sonnet-2410
The continue statement on lines 580-581 is used to avoid processing the same text unit (c_id) multiple times. The reason for this is:

The same text unit (c_id) might appear in multiple entities' source IDs (this_text_units)
We only want to process each unique text unit once and store its metadata in all_text_units_lookup
When we find a text unit we've already processed (it exists in all_text_units_lookup), we skip it with continue

For example:

Entity A might reference text units [1, 2, 3]
Entity B might reference text units [2, 3, 4]
Without the continue, text units 2 and 3 would be processed twice

The relation_counts calculation is still accurate because:

It counts how many one-hop neighbor nodes also reference this text unit
This count only needs to be calculated once per unique text unit
The order/index of when we first encounter the text unit is preserved in the order field

So while we want to count all relationships, we only need to process each unique text unit once and store its metadata (data, order, relation_counts) in the lookup dictionary.

syf0119 · 2024-11-26T06:44:36Z

Thank you very much for your answer, I understand this piece of code.

The order is the order in which the nodes (entities) come out of the vector database similarity query

results = await entities_vdb.query(query, top_k=query_param.top_k)

if not len(results):
    return None
node_datas = await asyncio.gather(
    *[knowledge_graph_inst.get_node(r["entity_name"]) for r in results]
)

The order and is the highest weighted, so only the text unit (c_id) that appear at the top nodes (entities) need to be counted.

  all_text_units = sorted(
      all_text_units, key=lambda x: (x["order"], -x["relation_counts"])
  )

syf0119 closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find most related text unit from entities #304

Find most related text unit from entities #304

syf0119 commented Nov 19, 2024

aiproductguy commented Nov 23, 2024

syf0119 commented Nov 26, 2024

Find most related text unit from entities #304

Find most related text unit from entities #304

Comments

syf0119 commented Nov 19, 2024

aiproductguy commented Nov 23, 2024

syf0119 commented Nov 26, 2024