[Question]: How I can use correctly node_parser ? #8400
Replies: 7 comments 1 reply
-
🤖 Hello, The error you're encountering, The
And the
There is no As for your question about what documents should contain as input, the Here is the relevant code for def get_nodes_from_documents(
self,
documents: Sequence[Document],
show_progress: bool = False,
) -> List[BaseNode]:
"""Parse document into nodes.
Args:
documents (Sequence[Document]): documents to parse
include_metadata (bool): whether to include metadata in nodes
"""
with self.callback_manager.event(
CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
) as event:
all_nodes: List[BaseNode] = []
documents_with_progress = get_tqdm_iterable(
documents, show_progress, "Parsing documents into nodes"
)
for document in documents_with_progress:
nodes = get_nodes_from_document(
document,
self.text_splitter,
self.include_metadata,
include_prev_next_rel=self.include_prev_next_rel,
)
all_nodes.extend(nodes)
if self.metadata_extractor is not None:
all_nodes = self.metadata_extractor.process_nodes(all_nodes)
event.on_end(payload={EventPayload.NODES: all_nodes})
return all_nodes This method parses each document into nodes based on the text splitter, whether to include metadata, and whether to include previous/next relationships. If a metadata extractor is provided, it is used to process the nodes. The method then returns a list of all nodes. I hope this helps! Let me know if you have any other questions. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@xrtechspot Your code looks fine to me. What's the full traceback? Running locally, I don't get any error
|
Beta Was this translation helpful? Give feedback.
-
Traceback (most recent call last): |
Beta Was this translation helpful? Give feedback.
-
It should be TextNode or Document objects ? PDFReader = download_loader("PDFReader") chunks[0]: id_='daf7fba7-6943-4c20-bbad-2e493aa4feae' embedding=None metadata={'page_label': '1', 'file_name': '100KB.pdf'} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={} hash='abee4acbeaefe1989ca7b8a8aa5b546c559a4327f2d1b56092427669ebbd0358' text='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec iaculis elit nec ante eleifend, \neget cursus tortor auctor. Integer at ultrices lorem, eget bibendum turpis. Praesent lacus \nsapien, ullamcorper gravida suscipit eu , pharetra et eros. Integer interdum dictum volutpat. \nNulla pellentesque ante vitae sapien luctus ornare. Ut a risus quis diam rhoncus rhoncus. \nInteger tortor lorem, vulputate non tempus eu, suscipit id tellus. In laoreet sollicitudin quam, et \ncondimentum justo ultrices eget. Pellentesque hendrerit suscipit diam ac porttitor. Aenean \ndictum id magna vel venenatis. In rhoncus tempus libero, sed consectetur nibh ultrices nec. \nAenean pulvinar euismod ligula at lacinia. Ut consectetur dui ipsum, a rhoncus lacus gravida \nvitae. In mollis tortor in libero lobortis molestie. Integer tempor justo ligula, eu euismod lectus \nfringilla eu. Proin vestibulum sodales tristique. Pellentesque pretium, nibh et aliquet \nscelerisque, felis nulla lobortis tellus, at tristique liber o ipsum a leo. Nulla mauris turpis, feugiat \neu lacus eu, eleifend malesuada lorem. Praesent quis justo ligula. Cras quam risus, ultricies at \nodio accumsan, maximus eleifend justo. Sed sed convallis elit. In finibus congue mauris at \nvenenatis. Praesent pell entesque lacus eros, nec auctor neque semper eget. Sed vehicula \nornare efficitur. Donec magna felis, ullamcorper et euismod eget, facilisis sed nibh. \n \n1 2 3 4 \n5 6 7 8 \n9 10 11 12 \n13 14 15 16 \n \n \n \nFusce efficitur mi ex. Quisque elementum elementum odio, a accumsan nibh consectetur sed. \nInterdum et malesuada fames ac ante ipsum primis in faucibus. Nunc a feugiat urna. Integer sit \namet tincidunt mauris. Donec id urna nec turpis laoreet aliquam ac id ante. Curabitur faucibus \nconsectetur velit, vitae efficitur odio facilisis sed. \n \nAenean pulvinar euismod ligula at lacinia. Ut consectetur dui ipsum, a rhoncus lacus gravida \nvitae. In mollis tortor in libero lobortis molestie. Integer tempor justo ligula, eu euismod lectus \nfringilla eu. Proin vestibulum sodales tr istique. Pellentesque pretium, nibh et aliquet \nscelerisque, felis nulla lobortis tellus, at tristique libero ipsum a leo. Nulla mauris turpis, feugiat \neu lacus eu, eleifend malesuada lorem. Praesent quis justo ligula. Cras quam risus, ultricies at \nodio acc umsan, maximus eleifend justo. Sed sed convallis elit. In finibus congue mauris at \nvenenatis. Praesent pellentesque lacus eros, nec auctor neque semper eget. Sed vehicula \nornare efficitur. Donec magna felis, ullamcorper et euismod eget, facilisis sed nibh. \nVestibulum fermentum augue a lacus mattis condimentum. Sed ligula lacus, faucibus non \nsuscipit sed, volutpat id nisl. Orci varius natoque penatibus et magnis dis parturient montes, ' start_char_idx=None end_char_idx=None text_template='{metadata_str}\n\n{content}' metadata_template='{key}: {value}' metadata_seperator='\n' Type of chunks[0]: <class 'llama_index.schema.Document'> PDFReader = download_loader("PDFReader") text_splitter = TokenTextSplitter( chunks[0]: id_='895cc92e-18d3-4994-980d-6636a1fbafbc' embedding=None metadata={'page_label': '1', 'file_name': '100KB.pdf'} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='98cb1b67-16de-493c-a826-92e182e3ddf9', node_type=None, metadata={'page_label': '1', 'file_name': '100KB.pdf'}, hash='abee4acbeaefe1989ca7b8a8aa5b546c559a4327f2d1b56092427669ebbd0358'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='41ecee94-7345-423b-b180-01834bfebbca', node_type=None, metadata={'page_label': '1', 'file_name': '100KB.pdf'}, hash='93d75a746391ff3e44c7a201619e03d0d442e82e2777dbf1a0b5afe76c773826')} hash='1ff9383c381ad547b5de6befba6eca467c0100ef4609e464b92149a248f3f242' text='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec iaculis elit nec ante eleifend, \neget cursus tortor auctor. Integer at ultrices lorem, eget bibendum turpis. Praesent lacus \nsapien, ullamcorper gravida suscipit eu , pharetra et eros. Integer interdum dictum volutpat. \nNulla pellentesque ante vitae sapien luctus ornare. Ut a risus quis diam rhoncus rhoncus. \nInteger tortor lorem, vulputate non tempus eu, suscipit id tellus. In laoreet sollicitudin quam, et \ncondimentum justo ultrices eget. Pellentesque hendrerit suscipit diam ac porttitor. Aenean \ndictum id magna vel venenatis. In rhoncus tempus libero, sed consectetur nibh ultrices nec. \nAenean pulvinar euismod ligula at lacinia. Ut consectetur dui ipsum, a rhoncus lacus gravida \nvitae. In mollis tortor in libero lobortis molestie. Integer tempor justo ligula, eu euismod lectus \nfringilla eu. Proin vestibulum sodales tristique. Pellentesque pretium, nibh et aliquet \nscelerisque, felis nulla lobortis tellus, at tristique liber o ipsum a leo. Nulla mauris turpis, feugiat \neu lacus eu, eleifend malesuada lorem. Praesent quis justo ligula. Cras quam risus, ultricies at \nodio accumsan, maximus eleifend justo. Sed sed convallis elit. In finibus congue mauris at \nvenenatis. Praesent pell entesque lacus eros, nec auctor neque semper eget. Sed vehicula' start_char_idx=None end_char_idx=None text_template='{metadata_str}\n\n{content}' metadata_template='{key}: {value}' metadata_seperator='\n' Type of chunks[0]: <class 'llama_index.schema.TextNode'> |
Beta Was this translation helpful? Give feedback.
-
getting the same error with your example document chunks[0]: id_='809bc63f-e6fd-4379-b03b-6cff6417a273' embedding=None metadata={'key': 'val'} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='test', node_type=None, metadata={'key': 'val'}, hash='cf062ecf6b048bf7cf6390375c3ac5145742844c0ffe15483b00aa90ab094e25')} hash='cf062ecf6b048bf7cf6390375c3ac5145742844c0ffe15483b00aa90ab094e25' text='text' start_char_idx=None end_char_idx=None text_template='{metadata_str}\n\n{content}' metadata_template='{key}: {value}' metadata_seperator='\n' Type of chunks[0]: <class 'llama_index.schema.TextNode'> Traceback (most recent call last): AttributeError: 'TextNode' object has no attribute 'get_doc_id'. Did you mean: 'ref_doc_id'? |
Beta Was this translation helpful? Give feedback.
-
@logan-markewich I identify the issue: If I use VectorStoreIndex is working
and issue is when using VectorStoreIndex.from_documents
What is the difference when using node_parser? |
Beta Was this translation helpful? Give feedback.
-
@xrtechspot I also got kind of stuck here, thanks to weaviate's llamaindex example. I also missed that default constructor of Documents are wrappers around data while nodes are chunks of documents. Also, nodes are first class citizen in LlamaIndex. You can learn more about nodes from here. |
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
Hello,
I'm new to Llamaindex, try to use node_parser to control chunk size of a pdf document.
When I run this script I get below error:
AttributeError: 'TextNode' object has no attribute 'get_doc_id'. Did you mean: 'ref_doc_id'
What I'm doing wrong? What documents should contain as input ?
Thank you
documents = []
file_name = Path(file_path).name
with open(file_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
for i, page in enumerate(pdf_reader.pages):
text = page.extract_text()
if text:
# Adding metadata to each Document object
metadata = {
'page_label': str(i + 1),
'file_name': file_name
}
doc = Document(text=text, doc_id=f"page_{i}", metadata=metadata)
documents.append(doc)
text_splitter = TokenTextSplitter(
separator="\n",
chunk_size=512,
chunk_overlap=128
)
node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)
nodes = node_parser.get_nodes_from_documents(documents)
Beta Was this translation helpful? Give feedback.
All reactions