Speed to process 11MB of text into vector database #321

GaryDean · 2024-11-22T06:46:08Z

I am creating a vector database using this hardware:

Hardware:
    LENOVO_MT_82WK_BU_idea_FM_Legion Pro 5 16IRX8 32GB
    NVIDIA GeForce RTX 4070 8GB

Text data is as follows:

Data Files:
   Total # text files: 3,482 files
       Total filesize: 11,626,546 bytes
     Average filesize: 3,339. bytes
      Median filesize: 2,908 bytes
    Smallest filesize: 36 bytes
     Largest filesize: 245,026 bytes

I am using default models to process this data.

To process this vector database took ~32 hours.

Am I "holding it wrong"?

The text was updated successfully, but these errors were encountered:

fatehss · 2024-11-24T18:06:41Z

It took me many hours to process a csv of around 2mb... we are also experiencing this issue

LarFii · 2024-11-27T11:05:23Z

It’s common for local models to process data slowly, especially when using smaller models.

GaryDean · 2024-11-27T22:58:50Z

the statistics above were produced using the default models, gpt-4o-mini and text-embedding-3-small.

jingerhuang · 2024-12-17T13:15:39Z

我也一样遇到了速度慢的问题，我使用了“本地实现的文本向量化函数”和“远程的大模型接口(deepseek)”。根据硬件使用情况进行判断，问题应该是在使用大模型进行处理的部分，而不是向量化部分。根据远程接口统计的情况来看，每次输入约3kTokens，输出1.2kTokens，每次耗时60~80秒，这符合正常的大模型处理速度。问题应该是出在为什么需要处理这么多Tokens。一个123579字符的357KB的文本文档产生了（3+1.2）*89kTokens的需求，我认为这并不合理，可能有什么问题影响了性能

jingerhuang · 2024-12-17T13:51:34Z

我监听了上传的Tokens，每1500字符(在测试中主要是中文字符),使用了10758字符的提示词（1.2k单词）。返回的结构很完美。我想我可能找到速度缓慢缓慢的原因了。某种意义上这个速度可能是合理的╮(╯_╰)╭

GaryDean mentioned this issue Nov 22, 2024

Testing on about 1000 documents and more #312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed to process 11MB of text into vector database #321

Speed to process 11MB of text into vector database #321

GaryDean commented Nov 22, 2024

fatehss commented Nov 24, 2024

LarFii commented Nov 27, 2024

GaryDean commented Nov 27, 2024

jingerhuang commented Dec 17, 2024

jingerhuang commented Dec 17, 2024

Speed to process 11MB of text into vector database #321

Speed to process 11MB of text into vector database #321

Comments

GaryDean commented Nov 22, 2024

fatehss commented Nov 24, 2024

LarFii commented Nov 27, 2024

GaryDean commented Nov 27, 2024

jingerhuang commented Dec 17, 2024

jingerhuang commented Dec 17, 2024