Skip to content

Provide tools to shard batches across different workers in Pytorch DataLoader #176

@arbennett

Description

@arbennett

Is your feature request related to a problem?

I have a common workflow where I use xbatcher in conjunction with torchdata.datapipes to load data that works really well for chaining together transformations. But, when putting the resulting datapipe into the torch.utils.data.DataLoader with num_workers>1 ends up having each worker create it's own copy of the full dataset that gets iterated over. This results in a single "epoch" being actually num_workers passes over the full data.

Describe the solution you'd like

Ideally, it would be nice to have a hook into xbatcher that we can use to specify as a worker_init_fn to the torch.utils.data.DataLoader so that each worker only handles it's own unique portion of the dataset.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedwontfixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions