How to speed up Metadata extraction #8702

DaveCBeck · 2023-11-06T11:55:21Z

DaveCBeck
Nov 6, 2023

Hi, I'm extracting metadata from nodes using MetadataExtractor() with a call to openAI, but it's painfully slow given the 1,800 nodes I'm testing with - especially given the odd 10min wait for a timeout.

From a search of the documents and a look at the code for Metadataextractor, OpenAI etc., I can't see how to batch these requests, nor how to set a shorter timeout. Is there something I'm missing or should I be raising a feature request for this?

Basically what I'd love to know how to do, or see implemented, is the same speed-up as Embedding generation with batching.

@dosu-bot · 2023-11-06T12:10:20Z

dosubot[bot]
bot Nov 6, 2023

🤖

Based on the context provided, there are a few ways you can speed up the metadata extraction process in LlamaIndex.

Parallel Processing: You can use parallel processing to speed up the process. This can be done using Python's multiprocessing module or similar libraries. Here's an example of how you could modify your code to use multithreading:

from concurrent.futures import ThreadPoolExecutor

def extract_metadata():
    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    documents = SimpleDirectoryReader("Data").load_data()

    text_splitter = TokenTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=128)
    
    metadata_extractor = MetadataExtractor(
      extractors=[
            TitleExtractor(nodes=5, llm=llm),
            KeywordExtractor(keywords=2, llm=llm),
      ])

    node_parser = SimpleNodeParser(
      text_splitter=text_splitter, metadata_extractor=metadata_extractor)

    service_context = ServiceContext.from_defaults(llm=llm)

    with ThreadPoolExecutor() as executor:
        nodes = list(executor.map(node_parser.get_nodes_from_documents, documents))

    vector_index = VectorStoreIndex(nodes, service_context=service_context, show_progress=True)
    
    kg_index = KnowledgeGraphIndex(nodes, max_tiplets_per_chunk=4, service_context=service_context,
                                             show_progress=True)
    
    return kg_index, vector_index

Batching Requests: The extract method in the MetadataExtractor class processes each node one by one in a loop. To speed up the process, you can modify the method to process multiple nodes at once if your underlying extraction methods support batch processing.
Setting a Shorter Timeout: If the underlying methods used in the extract method support timeout settings, you can set a shorter timeout to speed up the process. However, this might lead to incomplete results if the timeout is too short.
Optimizing the Underlying Methods: The speed of the metadata extraction also depends on the efficiency of the underlying methods used in the extract method. You can try to optimize these methods to speed up the process.
Optimizing the Node Selection Process: In the extract method of the MetadataExtractor class, there is a loop that selects nodes to extract titles from. If the selection criteria are complex and time-consuming, optimizing this process could also speed up the overall metadata extraction.
Custom Extractors: If the provided extractors do not fit your needs, you can also define a custom extractor. This could be optimized to your specific needs and potentially increase the speed of metadata extraction.

Please note that these are general suggestions and might require adjustments to fit your specific needs. If you need further assistance, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up Metadata extraction #8702

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to speed up Metadata extraction #8702

DaveCBeck Nov 6, 2023

Replies: 1 comment

dosubot[bot] bot Nov 6, 2023

Sources

DaveCBeck
Nov 6, 2023

dosubot[bot]
bot Nov 6, 2023