Replies: 1 comment
-
🤖 Based on the context provided, there are a few ways you can speed up the metadata extraction process in LlamaIndex.
from concurrent.futures import ThreadPoolExecutor
def extract_metadata():
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
documents = SimpleDirectoryReader("Data").load_data()
text_splitter = TokenTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=128)
metadata_extractor = MetadataExtractor(
extractors=[
TitleExtractor(nodes=5, llm=llm),
KeywordExtractor(keywords=2, llm=llm),
])
node_parser = SimpleNodeParser(
text_splitter=text_splitter, metadata_extractor=metadata_extractor)
service_context = ServiceContext.from_defaults(llm=llm)
with ThreadPoolExecutor() as executor:
nodes = list(executor.map(node_parser.get_nodes_from_documents, documents))
vector_index = VectorStoreIndex(nodes, service_context=service_context, show_progress=True)
kg_index = KnowledgeGraphIndex(nodes, max_tiplets_per_chunk=4, service_context=service_context,
show_progress=True)
return kg_index, vector_index
Please note that these are general suggestions and might require adjustments to fit your specific needs. If you need further assistance, feel free to ask. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm extracting metadata from nodes using MetadataExtractor() with a call to openAI, but it's painfully slow given the 1,800 nodes I'm testing with - especially given the odd 10min wait for a timeout.
From a search of the documents and a look at the code for Metadataextractor, OpenAI etc., I can't see how to batch these requests, nor how to set a shorter timeout. Is there something I'm missing or should I be raising a feature request for this?
Basically what I'd love to know how to do, or see implemented, is the same speed-up as Embedding generation with batching.
Beta Was this translation helpful? Give feedback.
All reactions