-
Notifications
You must be signed in to change notification settings - Fork 814
Migrate datasets to build on top of torchdata datapipes #1494
Description
🚀 Feature
Motivation
https://github.com/pytorch/data#why-composable-data-loading
user-experience: TorchData datasets enable new functional API, auto-sharding, and snapshotting support out-of-the-box. They also enable standard flow-control like batching, collation, shuffling, bucketing, and mapping/transformation using user-defined functions and transforms (UDFs).
Maintenance: By relying on TorchData, we no longer have to maintain low level functionality like downloading, extracting, caching, file/steam parsing, etc.
Reference
Examples: https://github.com/facebookexternal/torchdata/tree/main/examples/text
TorchData: https://github.com/facebookexternal/torchdata
Backlog of datasets
- AG_NEWS migrate AG_NEWS to datapipes. #1498
- AmazonReviewFull migrate Amazon Review Full to datapipes. #1499
- AmazonReviewPolarity Migrating AmazonReviewPolarity to datapipes #1490
- DBpedia migrate DBPedia to datapipes. #1500
- SogouNews migrate SogouNews to datapipes. #1503
- YelpReviewFull migrate YelpReviewFull to datapipes. #1507
- YelpReviewPolarity migrate YelpReviewPolarity to datapipes. #1509
- YahooAnswers migrate YahooAnswers to datapipes. #1508
- CoNLL2000Chunking migrate CONLL 2000 to datapipes. #1515
- UDPOS migrate UDPOS to datapipes. #1535
- IWSLT2016 migrate IWSLT2016 to datapipes. #1545
- IWSLT2017 migrate IWSLT2017 to datapipes. #1547
- Multi30K migrate Multi30k to datapipes. #1536
- SQuAD1 migrate SQUAD1 to datapipes. #1513
- SQuAD2 Migrate squad2 to datapipes #1514
- PennTreebank Migrating PennTreebank to datapipes #1511
- WikiText103 Migrate WikiText103 to datapipes #1518
- WikiText2 Migrate WikiText2 to datapipes #1519
- EnWik9 Migrating EnWik9 to datapipes #1511 #1512
- IMDB Migrate IMDB to datapipes #1531
- SST2 Migrate SST2 from experimental to datasets folder #1538
- CC-100 add CC100 #1562
Contributing
Please leave a message below if you plan to work on particular dataset(s) to avoid duplication of efforts. Also please link to the corresponding PRs.
cc: @Nayef211 , @abhinavarora , @erip , @ejguan , @VitalyFedyunin