Is it necessary to store all `node_id` in `baseindex` by index_struct #8541

azurewtl · 2023-10-28T07:40:26Z

azurewtl
Oct 28, 2023

What is the intention of design of storing all {vector_id:node_id} mapping information in index such as vector_index, or is it necessary?
Shouldn't all the node information been storage in index_store? Otherwise it will eat up python memory stack if I instantiate multiple Index over the same storage_context, or with too many of small nodes?

class VectorStoreIndex(BaseIndex[IndexDict]):
 ...
  def insert_nodes(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
      self._insert(nodes, **insert_kwargs)
      self._storage_context.index_store.add_index_struct(self._index_struct)

In vector index, I found insert_nodes takes two step, which make me really confused:
Step[1]: update the mapping of nodes_dict in VectorStoreIndex._index_struct
Step[2]: update the data in physical database from _storage_context.index_store

@azurewtl · 2023-10-28T07:47:41Z

dosubot[bot]
bot Oct 28, 2023

🤖

Hello @azurewtl,

I'm here to help you with your questions about the LlamaIndex repository. Let's dive into your queries.

The insert_nodes method in the VectorStoreIndex class of the LlamaIndex repository stores all {vector_id:node_id} mapping information in the index to maintain a reference between the vector representations and the original nodes. This is necessary for the retrieval of original data when a search is performed on the vector space.

The method _add_nodes_to_index is responsible for this. It adds nodes to the index structure and document store if the vector store doesn't store text or if the _store_nodes_override flag is set to True. This is done to ensure that the original text data is not lost when the vector representations are stored.

As for the concern about memory consumption, it's a valid point. However, the design choice to store all {vector_id:node_id} mapping information in the index is a trade-off between memory usage and the ability to retrieve original data. If memory consumption becomes a problem, you might need to consider other strategies such as sharding the index across multiple machines, or using a more memory-efficient data structure for storing the index.

In the context shared, the SimpleVectorStore class in the simple.py file of the LlamaIndex repository provides an example of a more memory-efficient vector store. This class stores embeddings in a simple, in-memory dictionary, which can be more memory-efficient than storing all {vector_id:node_id} mapping information in the index. However, this comes with the trade-off of not being able to retrieve the original text data when a search is performed on the vector space.

As for the BaseIndex class, it is an abstract base class that provides a framework for creating, managing, and interacting with indices in the LlamaIndex data framework. It is designed to be extended by specific index implementations. It provides a consistent interface and common functionality for different types of indices in the LlamaIndex data framework. It allows for the creation, management, and interaction with indices, including inserting, deleting, and updating nodes or documents, and converting the index to different forms for querying or chatting.

I hope this helps! If you have any more questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

1 reply

azurewtl Oct 28, 2023
Author

Several places are not precise, and does not answer the core question:
Why not store node_id as vector_id, if vector_id in invalid string in certain vector, just modify the format of node_id in generation process. Which makes maintaining index more simple and straight-forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it necessary to store all `node_id` in `baseindex` by index_struct #8541

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Is it necessary to store all node_id in baseindex by index_struct #8541

azurewtl Oct 28, 2023

Replies: 1 comment · 1 reply

dosubot[bot] bot Oct 28, 2023

Sources

azurewtl Oct 28, 2023 Author

Is it necessary to store all `node_id` in `baseindex` by index_struct #8541

azurewtl
Oct 28, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Oct 28, 2023

azurewtl Oct 28, 2023
Author