Add dataset_validation tests #44
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
We have at least one dataset which has inconsistencies -
langchain-python-docs-text-embedding-ada-002has an extra duplicated .parquet file which means the dataset ends up with 2x the number of vectors it should have.Solution
Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset.
The first test added (test_all_datasets_valid) performs some basic validation of each dataset:
Does the number of vectors in the data files match what the metadata says?
Are there any duplicate ids?
This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped:
Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)
Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)
Type of Change