Skip to content

Conversation

@daverigby
Copy link
Contributor

Problem

We have at least one dataset which has inconsistencies - langchain-python-docs-text-embedding-ada-002 has an extra duplicated .parquet file which means the dataset ends up with 2x the number of vectors it should have.

Solution

Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset.

The first test added (test_all_datasets_valid) performs some basic validation of each dataset:

  • Does the number of vectors in the data files match what the metadata says?

  • Are there any duplicate ids?

This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped:

  • Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)

  • Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

Type of Change

  • None of the above: new tests

@daverigby
Copy link
Contributor Author

Running these tests on a 32GB GCP VM ( poetry run pytest tests/dataset_validation) takes around 3 minutes, and gives the following results:

FAILED tests/dataset_validation/test_validate_public_datasets.py::test_all_datasets_valid[langchain-python-docs-text-embedding-ada-002] - AssertionError: Count of vectors found in Dataset file (6952) does not match count in m
etadata (3476)                                                                                                  
FAILED tests/dataset_validation/test_validate_public_datasets.py::test_all_datasets_valid[movielens-user-ratings] - AssertionError: Not all vector ids are unique - found 960313 duplicates                                      
SKIPPED [1] tests/dataset_validation/test_validate_public_datasets.py:18: Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)                                                        
SKIPPED [1] tests/dataset_validation/test_validate_public_datasets.py:18: Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

The first failure (langchain-python-docs-text-embedding-ada-002) was previously known and was the motivation for adding this test.

The second failure (movielens-user-ratings) is a new issue uncovered by this - there's a large number of vectors with duplicate ids, but their vector data is different. From looking at the metadata, I believe the ID generation is not sufficiently "unique" - the id appears to be taken from the movie_id field, however there's multiple vectors (movie reviews) for a single movie. I believe we would need to combine movie_id with the user_id field to generate an actual unique id.

@daverigby
Copy link
Contributor Author

CC @jamescalam ^ It looks like you added the original dataset for movielens-user-ratings at https://huggingface.co/datasets/pinecone/movielens-recent-ratings - any comments on the issue above ?

@daverigby daverigby requested a review from miararoy February 9, 2024 17:03
Add test to validate all public datasets are valid. These are added to
their own directory as they can be slow to run and need a large amount
of RAM to hold each dataset.

The first test added (test_all_datasets_valid) performs some basic
validation of each dataset:

- Does the number of vectors in the data files match what the metadata
  says?

- Are there any duplicate ids?

This only checks datasets with 2M or fewer vectors, as larger ones
require more than 32GB of RAM to load and validate. This currently
means 2 datasets are skipped:

* Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than
  2,000,000 vectors (has 9,990,000)

* Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than
  2,000,000 vectors (has 8,841,823)
@daverigby daverigby force-pushed the daver/validate_public_dataset branch from 8212ecd to 1b750f3 Compare February 9, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants