feat: Add sparse vectors benchmark support for Qdrant #114

KShivendu · 2024-04-03T12:50:55Z

This PR does the following:

Introduce a sparse dataset reader
Refactor base client and dataclasses to support sparse vectors
Use msmarco-sparse-1M dataset used in https://github.com/qdrant/sparse-vectors-benchmark
Add sparse vector benchmark support for Qdrant.

TODO:

Apply same client refactors to engine classes (PR refactor: Update all engines to use Query and Record dataclasses #116)
Test all the engines locally with sparse and non-sparse datasets.
Run continuously with CI and cross check everything (Can only be done after merging. Read this)
Allow running sparse configs with only sparse datasets

dataset_reader/base_reader.py

datasets/datasets.json

engine/base_client/search.py

engine/base_client/upload.py

agourlay · 2024-04-05T10:43:40Z

@KShivendu Is there some result files to have a look at already?

… engine configs

KShivendu · 2024-04-09T07:27:56Z

Looks like the CI benchmarks for sparse vectors aren't actually running for now because we use qdrant/vector-db-benchmark:latest image. This PR needs to be merged to properly test this on CI.

However, I've testing this locally and it worked as expected. So we can merge once we are sure about the refactors introduced in this PR.

joein

Query in ann_h5_reader and maybe in some other readers require sparse_vectors to be set + #117

* fix: remove scipy, read csr matrix manually

benchmark/dataset.py

dataset_reader/base_reader.py

run.py

engine/clients/qdrant/configure.py

run.py

joein · 2024-04-09T15:33:17Z

engine/clients/qdrant/upload.py

+        for point in batch:
+            vector = {}
+            if point.vector is not None:
+                vector[""] = point.vector


why do we need to do it?

I thought that's the only way to do it. Using this now:

if point.sparse_vector is None: vector = point.vector else: vector = { "sparse": SparseVector( indices=point.sparse_vector.indices, values=point.sparse_vector.values, ) }

Would have been nice if we could keep it similar to search. But qdrant-client doesn't support this. Wdyt?

if point.sparse_vector is None: vector = point.vector else: vector = NamedSparseVector( name="sparse", vector=SparseVector( indices=point.sparse_vector.indices, values=point.sparse_vector.values, ), )

engine/clients/qdrant/upload.py

engine/clients/qdrant/search.py

engine/base_client/upload.py

engine/clients/qdrant/search.py

engine/base_client/upload.py

* refactoring: refactor sparse vector support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

KShivendu · 2024-04-12T11:33:32Z

I prefer not to bring changes, if they are not necessary
However, if you have already checked that it does not break anything and does not affect the performance, then I assume I don't have objections

I haven't tested this but imo lightweight dataclasses shouldn't cause significant perf degradations. It's not even pydantic, so no validations are involved.

Regarding the breaking anything part, I've tested Qdrant sparse and dense datasets locally and they worked. I did the refactor for other engines and also introduced Ruff linter pre-commit hook in this PR.

engine/clients/qdrant/upload.py

* refactor: Update all engines to use Query and Record dataclasses * feat: Add ruff in pre-commit hooks * fix: Type mismatches * fix: Redis search client types and var names * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: Type issues detected by linter * fix: iter_batches func type * refactor: knn_conditions should be class level constant --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

generall

Let's see how it works

feat: Add sparse vectors benchmark support in Qdrant

d14e5fe

KShivendu changed the title ~~feat: Add sparse vectors benchmark support in Qdrant~~ feat: Add sparse vectors benchmark support for Qdrant Apr 3, 2024

KShivendu added 2 commits April 3, 2024 18:33

fix: Self review

b9be8bb

feat: Add sparse dataset for CI benchmarks

7af636a

KShivendu requested review from agourlay, generall, joein and kacperlukawski April 3, 2024 13:38

generall reviewed Apr 3, 2024

View reviewed changes

dataset_reader/base_reader.py Outdated Show resolved Hide resolved

agourlay reviewed Apr 4, 2024

View reviewed changes

datasets/datasets.json Show resolved Hide resolved

KShivendu commented Apr 4, 2024

View reviewed changes

engine/base_client/search.py Show resolved Hide resolved

engine/base_client/upload.py Show resolved Hide resolved

KShivendu added 6 commits April 8, 2024 16:47

feat: Introduce SparseVector class

ce902b2

feat: Disallow sparse vector dataset being run with non sparse vector…

50ca05f

… engine configs

feat: use different engine config to run sparse vector benchmarks

feb3323

fix: use different engine config to run sparse vector benchmarks

2a653f7

feat: Optimize CI benchmarks workflow

9d0fc40

feat: Add 1M sparse dataset

218c775

KShivendu requested review from agourlay and generall April 9, 2024 07:38

KShivendu mentioned this pull request Apr 9, 2024

refactor: Update all engines to use Query and Record dataclasses #116

Merged

joein requested changes Apr 9, 2024

View reviewed changes

joein and others added 2 commits April 9, 2024 15:17

fix: remove scipy, read csr matrix manually (#117)

36bcfaa

* fix: remove scipy, read csr matrix manually

fix: Dataset query reader should have sparse_vector=None by default

4a7f09d

joein requested changes Apr 9, 2024

View reviewed changes

refactor: Changes based on feedback

074d06c

KShivendu requested a review from joein April 10, 2024 13:42

joein requested changes Apr 10, 2024

View reviewed changes

engine/clients/qdrant/search.py Outdated Show resolved Hide resolved

engine/base_client/upload.py Show resolved Hide resolved

feat: Use pydantic construct

6be36d2

KShivendu requested a review from joein April 16, 2024 09:32

KShivendu mentioned this pull request Apr 16, 2024

Median based TopK for sparse vectors scoring qdrant/qdrant#4037

Merged

joein approved these changes Apr 16, 2024

View reviewed changes

joein requested changes Apr 16, 2024

View reviewed changes

engine/clients/qdrant/upload.py Show resolved Hide resolved

KShivendu and others added 7 commits April 16, 2024 23:14

Merge branch 'master' into feat/sparse-ci-benchmarks

750ad61

[pre-commit.ci] auto fixes from pre-commit.com hooks

931d9e3

for more information, see https://pre-commit.ci

fix: Type issue

6aa5ee8

fix: Allow python 3.8 since scipy is now removed

542fd33

fix: Add missing redis-m-16-ef-128 config

b11d41f

fix: redis container port

a30f25b

KShivendu requested a review from joein April 17, 2024 06:35

fix linter

4091b78

generall approved these changes Apr 17, 2024

View reviewed changes

joein approved these changes Apr 17, 2024

View reviewed changes

KShivendu merged commit 5343849 into master Apr 17, 2024

KShivendu deleted the feat/sparse-ci-benchmarks branch April 17, 2024 09:18

feat: Add sparse vectors benchmark support for Qdrant #114

feat: Add sparse vectors benchmark support for Qdrant #114

Uh oh!

Conversation

KShivendu commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agourlay commented Apr 5, 2024

Uh oh!

KShivendu commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joein Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

KShivendu Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KShivendu commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

generall left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KShivendu commented Apr 3, 2024 •

edited

Loading

KShivendu commented Apr 9, 2024 •

edited

Loading

KShivendu Apr 10, 2024 •

edited

Loading

KShivendu commented Apr 12, 2024 •

edited

Loading