Skip to content

Commit 1b750f3

Browse files
committed
Add dataset_validation tests
Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset. The first test added (test_all_datasets_valid) performs some basic validation of each dataset: - Does the number of vectors in the data files match what the metadata says? - Are there any duplicate ids? This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped: * Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000) * Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)
1 parent b775c7a commit 1b750f3

File tree

4 files changed

+44
-0
lines changed

4 files changed

+44
-0
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,8 @@ scratchpad.ipynb
99
.pytest_cache/
1010
.coverage
1111
poetry.lock
12+
13+
# Byte-compiled / optimized / DLL files
14+
__pycache__/
15+
*.py[cod]
16+
*$py.class

tests/dataset_validation/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
## Dataset Validation Tests
2+
3+
This directory contains tests which perform validation on
4+
datasets. These can be costly as they require downloading and checking
5+
data in each dataset, so these are located in a separate directory
6+
from the main system tests.

tests/dataset_validation/__init__.py

Whitespace-only changes.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import pytest
2+
import pinecone_datasets
3+
4+
5+
def pytest_generate_tests(metafunc):
6+
# Discover the set of datasets in the public repo, populating the
7+
# 'dataset' parameter with them all.
8+
metafunc.parametrize("dataset", pinecone_datasets.list_datasets())
9+
10+
11+
def test_all_datasets_valid(dataset):
12+
"""For the given dataset, check we can successfully load it from cloud
13+
storage (i.e. metadata checks pass and necessary files are present"""
14+
ds = pinecone_datasets.load_dataset(dataset)
15+
# Ideally should check all sets for this, but some are _very_ big and OOM kill
16+
# a typical VM
17+
if ds.metadata.documents > 2_000_000:
18+
pytest.skip(
19+
f"Skipping dataset '{dataset} which is larger than 2,000,000 vectors (has {ds.metadata.documents:,})"
20+
)
21+
df = ds.documents
22+
duplicates = df[df["id"].duplicated()]
23+
num_duplicates = len(duplicates)
24+
if num_duplicates:
25+
print("Summary of duplicate IDs in vectors:")
26+
print(duplicates)
27+
assert (
28+
num_duplicates == 0
29+
), f"Not all vector ids are unique - found {len(duplicates)} duplicates out of {len(df)} total vectors"
30+
31+
assert ds.metadata.documents == len(
32+
df
33+
), f"Count of vectors found in Dataset file ({len(ds.documents)}) does not match count in metadata ({ds.metadata.documents})"

0 commit comments

Comments
 (0)