Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed to process 11MB of text into vector database #321

Open
GaryDean opened this issue Nov 22, 2024 · 5 comments
Open

Speed to process 11MB of text into vector database #321

GaryDean opened this issue Nov 22, 2024 · 5 comments

Comments

@GaryDean
Copy link

I am creating a vector database using this hardware:

Hardware:
    LENOVO_MT_82WK_BU_idea_FM_Legion Pro 5 16IRX8 32GB
    NVIDIA GeForce RTX 4070 8GB

Text data is as follows:

Data Files:
   Total # text files: 3,482 files
       Total filesize: 11,626,546 bytes
     Average filesize: 3,339. bytes
      Median filesize: 2,908 bytes
    Smallest filesize: 36 bytes
     Largest filesize: 245,026 bytes

I am using default models to process this data.

To process this vector database took ~32 hours.

Am I "holding it wrong"?

@fatehss
Copy link

fatehss commented Nov 24, 2024

It took me many hours to process a csv of around 2mb... we are also experiencing this issue

@LarFii
Copy link
Collaborator

LarFii commented Nov 27, 2024

It’s common for local models to process data slowly, especially when using smaller models.

@GaryDean
Copy link
Author

the statistics above were produced using the default models, gpt-4o-mini and text-embedding-3-small.

@jingerhuang
Copy link

我也一样遇到了速度慢的问题,我使用了“本地实现的文本向量化函数”和“远程的大模型接口(deepseek)”。根据硬件使用情况进行判断,问题应该是在使用大模型进行处理的部分,而不是向量化部分。根据远程接口统计的情况来看,每次输入约3kTokens,输出1.2kTokens,每次耗时60~80秒,这符合正常的大模型处理速度。问题应该是出在为什么需要处理这么多Tokens。一个123579字符的357KB的文本文档产生了(3+1.2)*89kTokens的需求,我认为这并不合理,可能有什么问题影响了性能

@jingerhuang
Copy link

我监听了上传的Tokens,每1500字符(在测试中主要是中文字符),使用了10758字符的提示词(1.2k单词)。返回的结构很完美。我想我可能找到速度缓慢缓慢的原因了。某种意义上这个速度可能是合理的╮(╯_╰)╭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants