Add dataset_validation tests #44

daverigby · 2024-02-09T16:51:37Z

Problem

We have at least one dataset which has inconsistencies - langchain-python-docs-text-embedding-ada-002 has an extra duplicated .parquet file which means the dataset ends up with 2x the number of vectors it should have.

Solution

Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset.

The first test added (test_all_datasets_valid) performs some basic validation of each dataset:

Does the number of vectors in the data files match what the metadata says?
Are there any duplicate ids?

This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped:

Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)
Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

Type of Change

None of the above: new tests

daverigby · 2024-02-09T17:00:20Z

Running these tests on a 32GB GCP VM ( poetry run pytest tests/dataset_validation) takes around 3 minutes, and gives the following results:

FAILED tests/dataset_validation/test_validate_public_datasets.py::test_all_datasets_valid[langchain-python-docs-text-embedding-ada-002] - AssertionError: Count of vectors found in Dataset file (6952) does not match count in m
etadata (3476)                                                                                                  
FAILED tests/dataset_validation/test_validate_public_datasets.py::test_all_datasets_valid[movielens-user-ratings] - AssertionError: Not all vector ids are unique - found 960313 duplicates                                      
SKIPPED [1] tests/dataset_validation/test_validate_public_datasets.py:18: Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)                                                        
SKIPPED [1] tests/dataset_validation/test_validate_public_datasets.py:18: Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

The first failure (langchain-python-docs-text-embedding-ada-002) was previously known and was the motivation for adding this test.

The second failure (movielens-user-ratings) is a new issue uncovered by this - there's a large number of vectors with duplicate ids, but their vector data is different. From looking at the metadata, I believe the ID generation is not sufficiently "unique" - the id appears to be taken from the movie_id field, however there's multiple vectors (movie reviews) for a single movie. I believe we would need to combine movie_id with the user_id field to generate an actual unique id.

daverigby · 2024-02-09T17:02:32Z

CC @jamescalam ^ It looks like you added the original dataset for movielens-user-ratings at https://huggingface.co/datasets/pinecone/movielens-recent-ratings - any comments on the issue above ?

Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset. The first test added (test_all_datasets_valid) performs some basic validation of each dataset: - Does the number of vectors in the data files match what the metadata says? - Are there any duplicate ids? This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped: * Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000) * Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

daverigby requested a review from miararoy February 9, 2024 17:03

daverigby force-pushed the daver/validate_public_dataset branch from 8212ecd to 1b750f3 Compare February 9, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dataset_validation tests #44

Add dataset_validation tests #44

Uh oh!

daverigby commented Feb 9, 2024

Uh oh!

daverigby commented Feb 9, 2024

Uh oh!

daverigby commented Feb 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add dataset_validation tests #44

Are you sure you want to change the base?

Add dataset_validation tests #44

Uh oh!

Conversation

daverigby commented Feb 9, 2024

Problem

Solution

Type of Change

Uh oh!

daverigby commented Feb 9, 2024

Uh oh!

daverigby commented Feb 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants