Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find most related text unit from entities #304

Closed
syf0119 opened this issue Nov 19, 2024 · 2 comments
Closed

Find most related text unit from entities #304

syf0119 opened this issue Nov 19, 2024 · 2 comments

Comments

@syf0119
Copy link

syf0119 commented Nov 19, 2024

This is the code that find most related text unit from entities,Method name is _find_most_related_text_unit_from_entities.
If it is the total number of times a text unit is referenced by neighboring nodes, why use continue, see lines 580,581.
Can someone explain this part of the code?
Thank you for your help.

image

@aiproductguy
Copy link
Contributor

It's creating an efficient index of text units while tracking how interconnected each unit is with other nodes in the graph. The code section:

  1. Build a lookup dictionary (all_text_units_lookup) that contains unique text units and their metadata
  2. For each text unit, calculate:
  • How many times it's referenced by neighboring nodes (relation_counts)
  • Its order of appearance (order)
  • The actual content data (data)

#asked/claude-3-5-sonnet-2410
The continue statement on lines 580-581 is used to avoid processing the same text unit (c_id) multiple times. The reason for this is:

  1. The same text unit (c_id) might appear in multiple entities' source IDs (this_text_units)
  2. We only want to process each unique text unit once and store its metadata in all_text_units_lookup
  3. When we find a text unit we've already processed (it exists in all_text_units_lookup), we skip it with continue

For example:

  • Entity A might reference text units [1, 2, 3]
  • Entity B might reference text units [2, 3, 4]
  • Without the continue, text units 2 and 3 would be processed twice

The relation_counts calculation is still accurate because:

  • It counts how many one-hop neighbor nodes also reference this text unit
  • This count only needs to be calculated once per unique text unit
  • The order/index of when we first encounter the text unit is preserved in the order field

So while we want to count all relationships, we only need to process each unique text unit once and store its metadata (data, order, relation_counts) in the lookup dictionary.

@syf0119
Copy link
Author

syf0119 commented Nov 26, 2024

Thank you very much for your answer, I understand this piece of code.

  • The order is the order in which the nodes (entities) come out of the vector database similarity query
results = await entities_vdb.query(query, top_k=query_param.top_k)

if not len(results):
    return None
node_datas = await asyncio.gather(
    *[knowledge_graph_inst.get_node(r["entity_name"]) for r in results]
)
  • The order and is the highest weighted, so only the text unit (c_id) that appear at the top nodes (entities) need to be counted.
  all_text_units = sorted(
      all_text_units, key=lambda x: (x["order"], -x["relation_counts"])
  )

@syf0119 syf0119 closed this as completed Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants