invalid input for sparse float vector #35

yidasanqian · 2024-09-03T10:27:27Z

code：

analyzer = build_default_analyzer(language="zh")
bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.load("D:/Downloads/bm25_msmarco_v1.json")

def test():
  entities = [....]
  for entity in entities:    
    docs_embeddings = bm25_ef.encode_documents([entity["content"]])       
    # Convert csr_array to the format Milvus expects (List of Dictionaries)
    sparse_vector = {int(idx): float(val) for idx, val in zip(docs_embeddings[0].indices, docs_embeddings[0].data)}
    entity["content_sparse"] = sparse_vector
  res = client.upsert(collection_name=INDEX_NAME, data=entities)       
  return res["ids"]

trace back output：

  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "D:\Develop\CodeProjects\mkb\src\tests\mkb\storage\build_1k_data_for_milvus.py", line 52, in append_rag_eval_entry
    res = client.upsert(collection_name=INDEX_NAME, data=entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 276, in upsert
    raise ex from ex
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 272, in upsert
    res = conn.upsert_rows(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 148, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 144, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 183, in handler
    return func(self, *args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 123, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 87, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 715, in upsert_rows
    request = self._prepare_row_upsert_request(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 696, in _prepare_row_upsert_request
    return Prepare.row_upsert_param(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 461, in row_upsert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 389, in _parse_row_request
    entity_helper.pack_field_value_to_field_data(v, field_data, field_info)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\entity_helper.py", line 361, in pack_field_value_to_field_data
    raise ParamError(message="invalid input for sparse float vector")
pymilvus.exceptions.ParamError: <ParamError: (code=1, message=invalid input for sparse float vector)>

What's the reason? How to solve it?

The text was updated successfully, but these errors were encountered:

yidasanqian · 2024-09-04T03:19:12Z

entity["content"]="""
无机预涂板是一种环保板材。无机预涂板通常采用防火、抗菌、耐腐蚀和易清洁等，能够有效提高建筑物的装修质量和性能。\n以下是无机预涂板的环保特点：\n无机材料：无机预涂板基板采用无石棉硅酸钙板，不含有害的有机物，不会释放有害气体，不会对室内空气质量造成污染。\n绿色环保：无机预涂板符合绿色环保要求，不含有害物质，是一种绿色环保的装饰材料。\n耐久性：无机预涂板具有良好的耐久性，不易腐烂、老化、脆化和变形，使用寿命长，不会频繁更换，减少资源浪费。\n总之，无机预涂板是一种环保板材，符合绿色环保要求，对室内空气质量和人体健康无害，同时具有不错的装饰效果和耐久性。
"""

wxywb · 2024-09-04T12:47:38Z

"bm25_msmarco_v1.json" is only for English corpus, you need to fit parameters on your own documents. Here is code example

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
from pymilvus import MilvusClient,  DataType

analyzer = build_default_analyzer(language="zh")

docs = [
    "无机预涂板是一种具有优良性能的环保材料，常被应用于防火、抗菌、耐化学腐蚀等领域。",
    "无机预涂板以其卓越的耐火性、抗菌性和易维护性，被广泛应用于各类建筑场景。",
    "无机预涂板拥有防火、耐腐蚀、易清洁等特点，成为现代建筑中环保材料的首选。",
    "无机预涂板兼具环保和实用性，具有防火、抗菌、耐酸碱等多种优异性能。",
    "无机预涂板由于其出色的耐火性能、抗菌功能和环保特性，广泛应用于医院、实验室等场所。"
]


bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.fit(docs)


docs_embeddings = bm25_ef.encode_documents(docs)

query = '无机预涂板有耐火性吗?'

query_embeddings = bm25_ef.encode_queries([query])

client = MilvusClient(uri='test.db')

schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)

index_params = client.prepare_index_params()

client.create_collection(collection_name="test_sparse_vector", schema=schema)
index_params.add_index(
    field_name="sparse_vector",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
)

# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)

search_params = {
    "metric_type": "IP",
    "params": {}
}
for i in range(len(docs)):
    entity = {'sparse_vector': docs_embeddings[[i]], 'text':docs[i]}
    client.insert(collection_name="test_sparse_vector", data=entity)

results = client.search(collection_name="test_sparse_vector", data=query_embeddings[[0]], output_fields=['text'], search_params=search_params)
print(results)

yidasanqian · 2024-09-05T12:52:36Z

Documents are dynamically added to milvus and are more than 1 million in number, do I have to full fit all documents every time I execute a bm25 query?

wxywb · 2024-09-05T14:13:13Z

Although it is mathematically correct that BM25 should fit all inserted documents, a more practical approach is to save your parameters after fitting a large number of texts, and then load these saved parameters during query time to avoid refitting.

yidasanqian · 2024-09-06T07:22:37Z

These documents take up about 32 GB of memory. I need to load them all into memory, then execute fit, and finally call save, right? Do I need to do this process every time I add a document? Is there a way to incrementally update the parameters?

wxywb · 2024-09-06T07:39:39Z

yes, currently there is no incremental updates for bm25 and it is planned. Also Milvus will support native bm25, please stay tuned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid input for sparse float vector #35

invalid input for sparse float vector #35

yidasanqian commented Sep 3, 2024

yidasanqian commented Sep 4, 2024

wxywb commented Sep 4, 2024

yidasanqian commented Sep 5, 2024

wxywb commented Sep 5, 2024

yidasanqian commented Sep 6, 2024

wxywb commented Sep 6, 2024

invalid input for sparse float vector #35

invalid input for sparse float vector #35

Comments

yidasanqian commented Sep 3, 2024

yidasanqian commented Sep 4, 2024

wxywb commented Sep 4, 2024

yidasanqian commented Sep 5, 2024

wxywb commented Sep 5, 2024

yidasanqian commented Sep 6, 2024

wxywb commented Sep 6, 2024