Why don't sparse vectors support negative values? #38038
Replies: 5 comments 6 replies
-
sparse vector can definitely be negative, i think the check here is the index of each non 0 position can not be negative. Could I suggest that the docs be updated to indicate this to users? There is no mention on https://milvus.io/docs/sparse_vector.md |
Beta Was this translation helpful? Give feedback.
-
There is also a test documenting that negative values are not allowed: milvus/pkg/util/typeutil/schema_test.go Line 2346 in 302650a I guess it's a bug? To replicate:
I just used the sample data generated by Attu, removed all rows except the first one and then made one of the values negative.
Expected data would be added, instead can't add because of this issue. |
Beta Was this translation helpful? Give feedback.
-
In the source code of milvus, negative value will trigger an error here: milvus/pkg/util/typeutil/schema.go Line 1705 in 302650a The check logic is added by this pr: I think currently it doesn't accept negative value by design. Maybe we should remove this negative check logic. @zhengbuqian |
Beta Was this translation helpful? Give feedback.
-
Thanks for the Q! Yes currently negative value is intentionally disallowed in Milvus sparse float vector. The main reason is that as IP being the only supported metric of sparse float vector, 2 sparse float vector with no common dimension have an IP score of 0, and if negative values are allowed, 2 sparse float vector with common dimensions may have an negative IP score, and we can't determine whether they should be defined as relevant or irrelevant. Example: query = {1: 1.0}
A = {1: 0.1, 2: 0.2} # IP score: 0.1
B = {1: 0.9} # IP score: 0.9
C = {3: 0.1} # IP score: 0
# and many other vectors whose value is 0 at dim 1 When searching for top 3 neighbor of the query, we return If we allow negative value in sparse float vector, say we have another vector: D = {1: -1.0} # IP score: -1.0 When searching for top 3 neighbor of the query, should we return So we decided to not support negative sparse vector value before seeing any actual use case. With that being said I'd love to see how a negative sparse vector value can be utilized. You mentioned "The dimensions are 96x<300", do you mean the Colbert model outputs 96 embeddings for each doc, and each embedding is sparse and has less than 300 dims? Does the 300 mean number of dims with non-zero values or the global max dim? How is the sparse vector score computed/distance defined with the occurrence of negative value? |
Beta Was this translation helpful? Give feedback.
-
@zhengbuqian Thank you so much for such a clear explanation:) I suspected it was likely due to indexing. Then sparse cannot be used to store multi-vector as you are saying because of this limitation. Neither can arrays or json due to capacity limits. However we can keep the output vectors in postgres for now, it's just a more complex setup.. I suggest the docs be updated https://milvus.io/docs/sparse_vector.md In qdrant it's quite simple to do
From https://qdrant.tech/articles/late-interaction-models/ But I guess it's more of a feature request:) Thank you for the quick replies, I really like using Milvus. |
Beta Was this translation helpful? Give feedback.
-
Thanks to the Milvus team and community for such a wonderful database!
There is something I have been trying to understand and i can't find the answer in the docs or the discussions.
We use the output tokens by some models like answerai-colbert-small-v1 which contain negative values. The dimensions are 96x<300, floats, can be negative. I see these cannot be stored as a sparse vector (#32250)
milvus
or should I use a different vector db for this purpose so I can do re-ranking?谢谢大家:)
Beta Was this translation helpful? Give feedback.
All reactions