Why don't sparse vectors support negative values? #38038

csalg · 2024-11-26T19:14:39Z

csalg
Nov 26, 2024

Thanks to the Milvus team and community for such a wonderful database!

There is something I have been trying to understand and i can't find the answer in the docs or the discussions.

We use the output tokens by some models like answerai-colbert-small-v1 which contain negative values. The dimensions are 96x<300, floats, can be negative. I see these cannot be stored as a sparse vector (#32250)

Why?
Is it possible to store these output embeddings within milvus or should I use a different vector db for this purpose so I can do re-ranking?
Could I suggest that the docs be updated to indicate this to users? There is no mention on https://milvus.io/docs/sparse_vector.md

谢谢大家:)

Answered by xiaofan-luan

Nov 27, 2024

I looked a bit into the model, seems the output ColBert vector are a bunch of dense vectors, instead of sparse vectors?

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import Run, RunConfig, ColBERTConfig

ckpt = Checkpoint("answerdotai/answerai-colbert-small-v1", colbert_config=ColBERTConfig())

s = '''answerai-colbert-small-v1 is a new'''

embedded_query = ckpt.queryFromText([s], bsize=16)
print(embedded_query.shape)
# torch.Size([1, 32, 96])
# did a few experiment with different text length, seems the query shape is always 32*96: 32 dense embs with dim=96

embedded_doc = ckpt.docFromText([s], bsize=16)
print(embedded_doc[0].shape)
# torch.Size([1, 15, 96])
# doc …

View full answer

xiaofan-luan · 2024-11-27T01:33:01Z

xiaofan-luan
Nov 27, 2024
Maintainer

sparse vector can definitely be negative, i think the check here is the index of each non 0 position can not be negative.

Could I suggest that the docs be updated to indicate this to users? There is no mention on https://milvus.io/docs/sparse_vector.md
Can you explain a little more details?

0 replies

csalg · 2024-11-27T07:02:33Z

csalg
Nov 27, 2024
Author

There is also a test documenting that negative values are not allowed:

milvus/pkg/util/typeutil/schema_test.go

Line 2346 in 302650a

t.Run("negative value", func(t *testing.T) {

I guess it's a bug?

To replicate:

Set up a test collection with only a sparse vector field

Try adding some data with a negative value (attaching csv file for convenience).
test.sample.100.csv

I just used the sample data generated by Attu, removed all rows except the first one and then made one of the values negative.

sparse
"{""5"":0.9580539539435133,""7"":0.9455725790077081,""9"":-0.19584326355434434,""12"":0.09986877724420751,""13"":0.8244742154220037,""14"":0.6401484212170754}"

Expected data would be added, instead can't add because of this issue.

0 replies

yhmo · 2024-11-27T07:58:52Z

yhmo
Nov 27, 2024
Collaborator

In the source code of milvus, negative value will trigger an error here:

milvus/pkg/util/typeutil/schema.go

Line 1705 in 302650a

return errors.New("negative value in sparse float vector")

The check logic is added by this pr:
c93ae72

I think currently it doesn't accept negative value by design. Maybe we should remove this negative check logic. @zhengbuqian

0 replies

zhengbuqian · 2024-11-27T09:03:03Z

zhengbuqian
Nov 27, 2024
Collaborator

Thanks for the Q!

Yes currently negative value is intentionally disallowed in Milvus sparse float vector. The main reason is that as IP being the only supported metric of sparse float vector, 2 sparse float vector with no common dimension have an IP score of 0, and if negative values are allowed, 2 sparse float vector with common dimensions may have an negative IP score, and we can't determine whether they should be defined as relevant or irrelevant.

Example:

query = {1: 1.0}

A = {1: 0.1, 2: 0.2}    # IP score: 0.1
B = {1: 0.9}               # IP score: 0.9
C = {3: 0.1}               # IP score: 0
# and many other vectors whose value is 0 at dim 1

When searching for top 3 neighbor of the query, we return B then A, but don't return C as C is irrelevant to the query.

If we allow negative value in sparse float vector, say we have another vector:

D = {1: -1.0}           # IP score: -1.0

When searching for top 3 neighbor of the query, should we return D after B and A? It is hard to define. On the one hand D has a common dim with the query so they are somewhat revelant; on the other hand D has an IP score lower than C and many other unlisted vectors, all those are closer to the query by the IP score definition.

So we decided to not support negative sparse vector value before seeing any actual use case.

With that being said I'd love to see how a negative sparse vector value can be utilized. You mentioned "The dimensions are 96x<300", do you mean the Colbert model outputs 96 embeddings for each doc, and each embedding is sparse and has less than 300 dims? Does the 300 mean number of dims with non-zero values or the global max dim? How is the sparse vector score computed/distance defined with the occurrence of negative value?

4 replies

zhengbuqian Nov 27, 2024
Collaborator

I looked a bit into the model, seems the output ColBert vector are a bunch of dense vectors, instead of sparse vectors?

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import Run, RunConfig, ColBERTConfig

ckpt = Checkpoint("answerdotai/answerai-colbert-small-v1", colbert_config=ColBERTConfig())

s = '''answerai-colbert-small-v1 is a new'''

embedded_query = ckpt.queryFromText([s], bsize=16)
print(embedded_query.shape)
# torch.Size([1, 32, 96])
# did a few experiment with different text length, seems the query shape is always 32*96: 32 dense embs with dim=96

embedded_doc = ckpt.docFromText([s], bsize=16)
print(embedded_doc[0].shape)
# torch.Size([1, 15, 96])
# doc shape is <num tokens>*96

there are indeed negative values in the embedding values, but those embeddings are dense instead of sparse.

xiaofan-luan Nov 27, 2024
Maintainer

Thanks for the Q!

Yes currently negative value is intentionally disallowed in Milvus sparse float vector. The main reason is that as IP being the only supported metric of sparse float vector, 2 sparse float vector with no common dimension have an IP score of 0, and if negative values are allowed, 2 sparse float vector with common dimensions may have an negative IP score, and we can't determine whether they should be defined as relevant or irrelevant.

Example:
query = {1: 1.0}

A = {1: 0.1, 2: 0.2}    # IP score: 0.1
B = {1: 0.9}               # IP score: 0.9
C = {3: 0.1}               # IP score: 0
# and many other vectors whose value is 0 at dim 1
When searching for top 3 neighbor of the query, we return B then A, but don't return C as C is irrelevant to the query.

If we allow negative value in sparse float vector, say we have another vector:
D = {1: -1.0}           # IP score: -1.0
When searching for top 3 neighbor of the query, should we return D after B and A? It is hard to define. On the one hand D has a common dim with the query so they are somewhat revelant; on the other hand D has an IP score lower than C and many other unlisted vectors, all those are closer to the query by the IP score definition.

So we decided to not support negative sparse vector value before seeing any actual use case.

With that being said I'd love to see how a negative sparse vector value can be utilized. You mentioned "The dimensions are 96x<300", do you mean the Colbert model outputs 96 embeddings for each doc, and each embedding is sparse and has less than 300 dims? Does the 300 mean number of dims with non-zero values or the global max dim? How is the sparse vector score computed/distance defined with the occurrence of negative value?

is this a strong limitation that sparse vector can not be negative?
I thought it might be reasonable to support negative sparse vector

xiaofan-luan Nov 27, 2024
Maintainer

I looked a bit into the model, seems the output ColBert vector are a bunch of dense vectors, instead of sparse vectors?

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import Run, RunConfig, ColBERTConfig

ckpt = Checkpoint("answerdotai/answerai-colbert-small-v1", colbert_config=ColBERTConfig())

s = '''answerai-colbert-small-v1 is a new'''

embedded_query = ckpt.queryFromText([s], bsize=16)
print(embedded_query.shape)
# torch.Size([1, 32, 96])
# did a few experiment with different text length, seems the query shape is always 32*96: 32 dense embs with dim=96

embedded_doc = ckpt.docFromText([s], bsize=16)
print(embedded_doc[0].shape)
# torch.Size([1, 15, 96])
# doc shape is <num tokens>*96

there are indeed negative values in the embedding values, but those embeddings are dense instead of sparse.

this seems not a dense or sparse but more like a multi vector.
Right now milvus don't support multi vector

Answer selected by csalg

zhengbuqian Nov 27, 2024
Collaborator

is this a strong limitation that sparse vector can not be negative?
I thought it might be reasonable to support negative sparse vector

@xiaofan-luan the limitation is more about the definition of distance/similarity with the of presence negative values, less on the implementation side.

csalg · 2024-11-27T12:01:28Z

csalg
Nov 27, 2024
Author

@zhengbuqian Thank you so much for such a clear explanation:) I suspected it was likely due to indexing.

Then sparse cannot be used to store multi-vector as you are saying because of this limitation. Neither can arrays or json due to capacity limits. However we can keep the output vectors in postgres for now, it's just a more complex setup..

I suggest the docs be updated https://milvus.io/docs/sparse_vector.md
~~Also this tutorial can be a bit misleading: https://milvus.io/docs/hybrid_search_with_milvus.md~~

In qdrant it's quite simple to do

query = "What else can be done with just all-MiniLM-L6-v2 model?"

client.query_points(
    collection_name="my-collection",
    prefetch=[
        # Prefetch the dense embeddings of the top-50 documents
        models.Prefetch(
            query=model.encode(query).tolist(),
            using="dense-vector",
            limit=50,
        )
    ],
    # Rerank the top-50 documents retrieved by the dense embedding model
    # and return just the top-10. Please note we call the same model, but
    # we ask for the token embeddings by setting the output_value parameter.
    query=model.encode(query, output_value="token_embeddings").tolist(),
    using="output-token-embeddings",
    limit=10,
)

From https://qdrant.tech/articles/late-interaction-models/

But I guess it's more of a feature request:)

Thank you for the quick replies, I really like using Milvus.

2 replies

zhengbuqian Nov 27, 2024
Collaborator

Indeed, Milvus's support for ColBERT style multivector indexing/searching is still being developed.

The 'sparse' part of sparse vector mainly means the values of a majority of dims of each vector are zero, which is distinct from multivector.

BTW can you help clarify which part of https://milvus.io/docs/hybrid_search_with_milvus.md is misleading? thanks!

csalg Nov 27, 2024
Author

Indeed, Milvus's support for ColBERT style multivector indexing/searching is still being developed.

It will be great 😎

BTW can you help clarify which part of https://milvus.io/docs/hybrid_search_with_milvus.md is misleading? thanks!

Misunderstanding. BGE-M3 outputs dense, sparse and multi-vector representation. Apologies, have striked-through.

Thanks:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why don't sparse vectors support negative values? #38038

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why don't sparse vectors support negative values? #38038

csalg Nov 26, 2024

Replies: 5 comments · 6 replies

xiaofan-luan Nov 27, 2024 Maintainer

csalg Nov 27, 2024 Author

yhmo Nov 27, 2024 Collaborator

zhengbuqian Nov 27, 2024 Collaborator

zhengbuqian Nov 27, 2024 Collaborator

xiaofan-luan Nov 27, 2024 Maintainer

xiaofan-luan Nov 27, 2024 Maintainer

zhengbuqian Nov 27, 2024 Collaborator

csalg Nov 27, 2024 Author

zhengbuqian Nov 27, 2024 Collaborator

csalg Nov 27, 2024 Author

csalg
Nov 26, 2024

Replies: 5 comments 6 replies

xiaofan-luan
Nov 27, 2024
Maintainer

csalg
Nov 27, 2024
Author

yhmo
Nov 27, 2024
Collaborator

zhengbuqian
Nov 27, 2024
Collaborator

zhengbuqian Nov 27, 2024
Collaborator

xiaofan-luan Nov 27, 2024
Maintainer

xiaofan-luan Nov 27, 2024
Maintainer

zhengbuqian Nov 27, 2024
Collaborator

csalg
Nov 27, 2024
Author

zhengbuqian Nov 27, 2024
Collaborator

csalg Nov 27, 2024
Author