-
Notifications
You must be signed in to change notification settings - Fork 126
feat: Add sparse vectors benchmark support for Qdrant #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@KShivendu Is there some result files to have a look at already? |
|
Looks like the CI benchmarks for sparse vectors aren't actually running for now because we use However, I've testing this locally and it worked as expected. So we can merge once we are sure about the refactors introduced in this PR. |
joein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query in ann_h5_reader and maybe in some other readers require sparse_vectors to be set + #117
* fix: remove scipy, read csr matrix manually
engine/clients/qdrant/upload.py
Outdated
| for point in batch: | ||
| vector = {} | ||
| if point.vector is not None: | ||
| vector[""] = point.vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that's the only way to do it. Using this now:
if point.sparse_vector is None:
vector = point.vector
else:
vector = {
"sparse": SparseVector(
indices=point.sparse_vector.indices,
values=point.sparse_vector.values,
)
}Would have been nice if we could keep it similar to search. But qdrant-client doesn't support this. Wdyt?
if point.sparse_vector is None:
vector = point.vector
else:
vector = NamedSparseVector(
name="sparse",
vector=SparseVector(
indices=point.sparse_vector.indices,
values=point.sparse_vector.values,
),
)* refactoring: refactor sparse vector support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
I haven't tested this but imo lightweight dataclasses shouldn't cause significant perf degradations. It's not even pydantic, so no validations are involved. Regarding the breaking anything part, I've tested Qdrant sparse and dense datasets locally and they worked. I did the refactor for other engines and also introduced Ruff linter pre-commit hook in this PR. |
* refactor: Update all engines to use Query and Record dataclasses * feat: Add ruff in pre-commit hooks * fix: Type mismatches * fix: Redis search client types and var names * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issues detected by linter * fix: iter_batches func type * refactor: knn_conditions should be class level constant --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
generall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see how it works
This PR does the following:
msmarco-sparse-1Mdataset used in https://github.com/qdrant/sparse-vectors-benchmarkTODO: