Skip to content

Commit 8212ecd

Browse files
committed
Add dataset_validation tests
Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset. The first test added (test_all_datasets_valid) performs some basic validation of each dataset: - Does the number of vectors in the data files match what the metadata says? - Are there any duplicate ids? This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped: * Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000) * Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)
1 parent b775c7a commit 8212ecd

File tree

3 files changed

+38
-0
lines changed

3 files changed

+38
-0
lines changed

tests/dataset_validation/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
## Dataset Validation Tests
2+
3+
This directory contains tests which perform validation on
4+
datasets. These can be costly as they require downloading and checking
5+
data in each dataset, so these are located in a separate directory
6+
from the main system tests.

tests/dataset_validation/__init__.py

Whitespace-only changes.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
import pytest
2+
import pinecone_datasets
3+
4+
5+
def pytest_generate_tests(metafunc):
6+
# Discover the set of datasets in the public repo, populating the
7+
# 'dataset' parameter with them all.
8+
metafunc.parametrize("dataset", pinecone_datasets.list_datasets())
9+
10+
11+
def test_all_datasets_valid(dataset):
12+
"""For the given dataset, check we can successfully load it from cloud
13+
storage (i.e. metadata checks pass and necessary files are present"""
14+
ds = pinecone_datasets.load_dataset(dataset)
15+
# Ideally should check all sets for this, but some are _very_ big and OOM kill
16+
# a typical VM
17+
if ds.metadata.documents > 2_000_000:
18+
pytest.skip(
19+
f"Skipping dataset '{dataset} which is larger than 2,000,000 vectors (has {ds.metadata.documents:,})"
20+
)
21+
df = ds.documents
22+
assert ds.metadata.documents == len(
23+
df
24+
), f"Count of vectors found in Dataset file ({len(ds.documents)}) does not match count in metadata ({ds.metadata.documents})"
25+
duplicates = df[df["id"].duplicated()]
26+
num_duplicates = len(duplicates)
27+
if num_duplicates:
28+
print("Summary of duplicate IDs in vectors:")
29+
print(duplicates)
30+
assert (
31+
num_duplicates == 0
32+
), f"Not all vector ids are unique - found {len(duplicates)} duplicates"

0 commit comments

Comments
 (0)