IndexGraph is redundant with node_relation #8164

azurewtl · 2023-10-17T15:40:01Z

azurewtl
Oct 17, 2023

By skim throughly the code about TreeIndex, I think the idea is brilliant. However I am a bit confused by the current implementation.
In order to keep track of the tree structure, It uses a dict in TreeIndex, which seems have the same function as node relationship, and the latter is more intuitive in my option.
Additionally, current implement of GPTTreeIndexBuilder merely merges the input nodes/documents until the number hit num_children parameter, regardless of its metadata.
I think the nodes should be merge primary based on raw document file path, in such way the node from actually the same file is under the same parent.
I am think about to reimplement the tree index behavior using raw_documents. The splitter should takes the whole document as a whole, and split the document based on the it path hierarchy and paragraphs. A tree structure should be generated during the splitting of document, to preserve the nature knowledge structure of folders.

logan-markewich · 2023-10-17T17:35:02Z

logan-markewich
Oct 17, 2023
Maintainer

@azurewtl the current tree index has not been touched/maintained in quite some time 😓 It is in need of quite lot of refactoring tbh.

If you are ambitious enough to tackle it, it would be appreciated 🙏🏻 tbh I haven't even gone through all the code in there lol

0 replies

azurewtl · 2023-10-24T09:31:33Z

azurewtl
Oct 24, 2023
Author

Generalized from the original topic

The goal here is to build a hierarchy of knowledge base structure that helping retriever to find the MOST relevant chunks, when given a bunch of documents in folders, which contains many useful hierarchal meta info in it's own folder structure.

My current approach would be:

A new FileDirectoryReader who does NOT split file into multiple docs, and it turns everything into markdown format for unified paragraph extraction logics.
A smarter HierarchicalNodeParser that return hierarchical node regarding it's file path and markdown paragraph
A modified ReactAgent that can recursively decided, whether retrieve parent/child node based of previous retrieved nodes.

Below is my design philosophy：

Hierarchy of knowledge should be purely stored in llama_index.schema.BaseNode

Recursive browsing should be purely implemented in one place, instating of stacking many index or engine.

Existing Approach I have Researched

During the evaluation of my ambition(surprised by the comprehensiveness of exist feature), I have found 4 existing modules, which many construct such hierarchy of knowledge base.
I will rule out the approach one by one, not sure whether my understanding is adequate:

TreeIndex uses GPTTreeIndexBuilder to build tree from a list of nodes, from bottoms up, and storage it as IndexGraph property. However the tree structure is simply defined by TreeIndex.num_children. During retrieving stage, it could select leafs from a specified depth of level(root/all_node mode seems too simple).

Reason not to use:
IndexGraph introduce unnecessary complexity, since the node object already had relationship property to store the hierarchy

**RouterQueryEngine takes tools/query_engine as candidate, user PydanticMultiSelector to select the real queryEngine answers the question.
ComposableGraph takes children_indices and conducts similar selection as RouterQueryEngine. Instead of prompts _metadatas of provided optional queryengines, it prompts index_summaries to LLMs. Comparing to RouterQueryEngine, ComposableGraph is an index object, which can be instantiated hierarchically, They are still very similar because they do not answer any question directly. They just routes

Reason not to use:
Both approach require manually construction of hierarchal index/queryengine, and under the hood, index/queryengine still need to find the relevant chunk of information.

NebulaGraphStore is indeed a pure graph of knowledge, translate natural language question to graph traversal query such as cypher.

Reason not to use:
The build stage of graph is tedious, if given a tons of folders and documents. It also takes more effort to work with original text than simply store the edges in node.relationship. In most cases, complexity and performance that graph database offers might not be needed in RAG application.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexGraph is redundant with node_relation #8164

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Below is my design philosophy：

Select a reply

IndexGraph is redundant with node_relation #8164

azurewtl Oct 17, 2023

Replies: 2 comments

logan-markewich Oct 17, 2023 Maintainer

azurewtl Oct 24, 2023 Author

Generalized from the original topic

My current approach would be:

Below is my design philosophy：

Existing Approach I have Researched

azurewtl
Oct 17, 2023

logan-markewich
Oct 17, 2023
Maintainer

azurewtl
Oct 24, 2023
Author