Skip to content

Consolidate collective functions #7534

@ananthsub

Description

@ananthsub

🚀 Feature

Lightning should offer a central place to use the collective functions provided here: https://pytorch.org/docs/stable/distributed.html#collective-functions

Motivation

LightningModule code is usually agnostic to what device its running on or whether its running in a distributed training environment. However, there are times where the module does need to rely on collective functions.

In Lightning, we currently have many places where these are offered:

Some of these call each other and the dependency isn't very clear now, so it is confusing for users which to go through.

Pitch

  1. Offer these utilities under a central place: pytorch_lightning/utilities/collectives.py for these utilities:
    barrier, all_gather, broadcast, etc

These should be very thin wrappers over the PyTorch distributed functions, checking if torch.distributed is available and initialized. If not, we return what's expected for single-process training.

  1. Update the callsites internally to use to these implementations

  2. Mark existing functions as deprecated and slated for removal in v1.6

cc @Borda @awaelchli @rohitgr7 @akihironitta @justusschock

Metadata

Metadata

Assignees

No one assigned

    Labels

    distributedGeneric distributed-related topicfeatureIs an improvement or enhancementlet's do it!approved to implementrefactor

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions