Skip to content

Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment #14981

@speediedan

Description

@speediedan

🚀 Feature

ddp_fork (and associated alias strategies) cannot currently be used along with native AMP due to the invocation of the CUDA Runtime API within the call to GradScaler in the NativeMixedPrecisionPlugin:

https://github.com/Lightning-AI/lightning/blob/c059db446e7bfea03fba91e598ad503f0d1c6581/src/pytorch_lightning/plugins/precision/native_amp.py#L53

which in turn initializes CUDA and poisons subsequent forks.

It may be possible with a future version of PyTorch to alter the default behavior of torch.cuda.is_available() to use an NVML-based CUDA assessment throughout Lightning. In the meantime, patching torch.cuda.is_available() with Lightning's implementation of the upstream NVML-based assessment can unlock this functionality.

I'll be opening a PR shortly that patches torch.cuda.is_available() within NativeMixedPrecisionPlugin (both Lite and PL versions) and adds a standalone test for the ddp_fork strategy in a CUDA and AMP context (adding a standalone test only for PL given how expensive the standalone multi-gpu tests can be).

Motivation

Many users will use AMP within the context of jupyter notebooks, where if using multiple GPUS, ddp_fork will be important to support.

Pitch

Allow the use of AMP within the context of jupyter notebooks, where if using multiple GPUS, ddp_fork will be important to support.
I will open a small PR shortly that makes this available.

Additional context

There's a related PR in PyTorch currently that may allow the requested modification of torch.cuda.is_available() throughout Lightning without needing to patch the function or add Lightning's own NVML-based assessment (once the relevant version of PyTorch is the minimum)

cc @justusschock @awaelchli @carmocca

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingprecision: ampAutomatic Mixed Precisionstrategy: ddpDistributedDataParallel

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions