Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Migrate datasets to build on top of torchdata datapipes #1494

@parmeet

Description

@parmeet

🚀 Feature

Motivation

https://github.com/pytorch/data#why-composable-data-loading

user-experience: TorchData datasets enable new functional API, auto-sharding, and snapshotting support out-of-the-box. They also enable standard flow-control like batching, collation, shuffling, bucketing, and mapping/transformation using user-defined functions and transforms (UDFs).

Maintenance: By relying on TorchData, we no longer have to maintain low level functionality like downloading, extracting, caching, file/steam parsing, etc.

Reference
Examples: https://github.com/facebookexternal/torchdata/tree/main/examples/text
TorchData: https://github.com/facebookexternal/torchdata

Backlog of datasets

Contributing

Please leave a message below if you plan to work on particular dataset(s) to avoid duplication of efforts. Also please link to the corresponding PRs.

cc: @Nayef211 , @abhinavarora , @erip , @ejguan , @VitalyFedyunin

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions