Sharded Plugin #4178

SeanNaren · 2020-10-15T15:43:30Z

What does this PR do?

Closes #817. Lots of related comments in the issue, but overall fairscale has done a great job of taking the initial DeepSpeed code and making a pytorch module to support the ZERO optimization feature (the main feature from DeepSpeed aside from some custom kernel ops + fp16). Thus I think for a V1, we should offer integration with fairscale and assist in getting DDP changes for model partitioning (model parallel) + await optimisations.

Will require additional tests before merging.

I also note that fairscale install crashes on a remote ubuntu machine. Installing from source however runs fine.

cc @blefaudeux and @ananthsub who mentioned integration already existing internally with lightning, so this PR may unify efforts or be unnecessary. If I could get a lookover this PR I would really appreciate it as well :)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-10-15T18:24:58Z

Hello @SeanNaren! Thanks for updating this PR.

In the file pytorch_lightning/accelerators/ddp_spawn_accelerator.py:

Line 281:86: W292 no newline at end of file

Comment last updated at 2020-11-24 21:16:57 UTC

SeanNaren · 2020-10-16T14:06:23Z

Going to put some stats here, I've been tracking CUDA memory using a callback inspired by this:

import torch
from pytorch_lightning import Callback


class CUDACallback(Callback):
    def on_train_epoch_start(self, trainer, pl_module):
        # Reset the memory use counter
        torch.cuda.reset_peak_memory_stats(trainer.global_rank)
        torch.cuda.synchronize(trainer.global_rank)

    def on_train_epoch_end(self, trainer, pl_module, outputs):
        torch.cuda.synchronize(trainer.global_rank)
        max_memory = torch.cuda.max_memory_allocated(trainer.global_rank) / 2 ** 20

        print(f"[{trainer.global_rank}] : Peak memory {max_memory:.1f}MiB")

Running this transformers script for token classification I've got some average peak memory stats after the first epoch:

Average peak memory allocated after 1 epoch on p3.8xlarge and p3.16xlarge instance types.

4GPUs DDP: 6840.1MiB
8GPUs DDP: 6840.1MiB
4GPUs FairScale OSS: 5263.2MiB (23% memory improvement compared to DDP)
8GPUs FairScale OSS: 4899.03MiB (28.38% memory improvement compared to DDP)

blefaudeux · 2020-10-16T16:21:15Z

Going to put some stats here, I've been tracking CUDA memory using a callback inspired by this:

import torch
from pytorch_lightning import Callback


class CUDACallback(Callback):
    def on_train_epoch_start(self, trainer, pl_module):
        # Reset the memory use counter
        torch.cuda.reset_peak_memory_stats(trainer.global_rank)

        # Dummy training loop
        torch.cuda.synchronize(trainer.global_rank)

    def on_train_epoch_end(self, trainer, pl_module, outputs):
        torch.cuda.synchronize(trainer.global_rank)
        max_memory = torch.cuda.max_memory_allocated(trainer.global_rank) / 2 ** 20

        print(f"[{trainer.global_rank}] : Peak memory {max_memory:.1f}MiB")

Running this transformers script for token classification I've got some average peak memory stats after the first epoch:

Average peak memory allocated after 1 epoch on p3.8xlarge and p3.16xlarge instance types.

4GPUs DDP: 6840.1MiB
8GPUs DDP: 6840.1MiB
4GPUs FairScale OSS: 5263.2MiB (23% memory improvement compared to DDP)
8GPUs FairScale OSS: 4899.03MiB (28.38% memory improvement compared to DDP)

Nice ! FYI there are more savings on the way, it would require wrapping the model and not using DDP (and autograd hook does the job) but that solves the gradient reduce issue and reduces the communications, which is important when multiple nodes are involved. It does not deprecate the version you tested, it just comes on top (if OSS is used with DDP you get what you measured, if OSS is used with a model wrap instead you get a better gradient sharding)

SeanNaren · 2020-10-16T16:29:10Z

Thanks @blefaudeux! Was excited to get numbers for this, really awesome work overall :)

Are the additional savings only tied to fairscale? Is there a place I can have a look at the code? Would be good to get a head start on figuring out API changes to support this

blefaudeux · 2020-10-16T16:53:55Z

Thanks @blefaudeux! Was excited to get numbers for this, really awesome work overall :)

Are the additional savings only tied to fairscale? Is there a place I can have a look at the code? Would be good to get a head start on figuring out API changes to support this

No PR yet, the work is in https://github.com/facebookresearch/fairscale/tree/oss_autograd
The interface could be

model = myAwesomeModel()
optimizer = OSS(*optimizer_params)
model = ShardedDDP(model, optimizer, *some_basic_params)

and then a normal training loop with optimizer and model. Gradients are automatically reduced to the right rank and the optimizer state and gradients are sharded, which shaves some more memory from what you have right now.

What do you think ?

SeanNaren · 2020-10-16T17:09:12Z

Yep interface makes sense to me. I also see the model dispatch making its way into this branch which is super cool!

I don't think it makes sense to support both DDP + OSS and Sharded DDP + OSS since we'll need to install fairscale regardless and Sharded DDP seems like the successor. I think it makes sense to go forward ShardedDDP and have it replace the current implementation, thoughts here?

blefaudeux · 2020-10-16T17:40:08Z

Yep interface makes sense to me. I also see the model dispatch making its way into this branch which is super cool!

I don't think it makes sense to support both DDP + OSS and Sharded DDP + OSS since we'll need to install fairscale regardless and Sharded DDP seems like the successor. I think it makes sense to go forward ShardedDDP and have it replace the current implementation, thoughts here?

It may require some benchmarking, basically one feature from DDP which is hard to replicate is the overlap in between BW and reduce, the gradients in DDP are all-reduced step by step when walking back the graph concurrently with the BW computations. Currently what's "easy" to do is FW then BW then reduce (not all reduce), but the overlap is lost. When we shard the model (next steps, a little more involving) it's not that much of an issue because we can overlap the reduce of the lower shard with the BW of the upper one.

From what I can see, currently (state+gradient sharding):

on a single node OSS + DDP is a little faster than OSS+ad-hoc-reduce but consumes more memory
when multiple nodes are involved, then OSS+ad-hoc-reduce wins it all (faster and less memory)

cc @msbaines and @mrshenli in case you're interested

SeanNaren · 2020-10-16T18:11:47Z

Yeah still not sure it's worth supporting the DDP version. Having the DDP lightning wrapper just for a small benefit in speed for single node setups (and personally I think what's more important is reducing memory allocation) where the cap of total GPUs is small, I don't think is necessary and adds confusion unless I'm mistaken!

SeanNaren · 2020-10-18T16:56:00Z

Pushing WIP changes integrating ShardedDDP using the oss_autograd fairscale branch, in a similar fashion to vanilla DDP. Unfortunately I had to do quite a bit of overriding due since the model requires multiple arguments (not just the input to the model, but multi-input/targets for forward/loss calculation).

Running into an issue using multiple GPUs where training hangs, currently investigating this. Also seeing if there is a nicer way to handle requiring grads for the input. Main issue being I've made the assumption (for now) that the inputs are tuples within the batch i.e within ModelDispatch:

# All inputs need to required_grad for autograd to properly track the first dispatch layer
# Will currently break if a dict or something, may require a recursive check
if isinstance(inputs, tuple):
     for i in inputs:
           i.requires_grad = True

@blefaudeux after investigation I've noticed that it only hangs when using torch automatic mixed precision autocast with OSS + SDP. Any reasons you think this could happen off the top of your head? Will continue to investigate, but curious if you had any suggestions!

SeanNaren · 2020-10-25T22:19:44Z

@williamFalcon I've been trying to keep up with the plugin API which is neat! I still think the fairscale integration should live as a native accelerator, because there are more features to come from fairscale, but was curious on your thoughts.

EDIT: offline conclusion is living as an accelerator :)

SeanNaren · 2020-10-28T19:09:24Z

A few updates to track here, running into issues supporting kwarg input for forward/backward pass. Seems a solution was discussed here: facebookresearch/fairscale#160 (comment) and here: facebookresearch/fairscale#160 (comment)

Seems like there are some performance issues when moving to multi-node which I haven't been able to test that ben has reported: facebookresearch/fairscale#157 (comment)

Since I don't think its worth longer term to just get the OSS optimizer into lightning without SDP, we'll put this on hold till performance issues and function args are solved!

ananthsub · 2020-10-28T19:20:02Z

@williamFalcon I've been trying to keep up with the plugin API which is neat! I still think the fairscale integration should live as a native accelerator, because there are more features to come from fairscale, but was curious on your thoughts.

naming nit: can we name this as sharded ddp accelerator? fairscale as a library will eventually have more components we may want to plug in elsewhere

SeanNaren · 2020-10-28T19:25:20Z

@ananthsub I was thinking about that... even ZeRO is better or Deepspeed, but I don't mind just calling it ShardedDDP

…to sync optimizer state before saving

…rn on rank zero

# Conflicts: # pytorch_lightning/accelerators/accelerator.py # pytorch_lightning/plugins/ddp_plugin.py # pytorch_lightning/trainer/connectors/checkpoint_connector.py

# Conflicts: # pytorch_lightning/accelerators/ddp2_accelerator.py # pytorch_lightning/accelerators/ddp_accelerator.py # pytorch_lightning/accelerators/ddp_hpc_accelerator.py # pytorch_lightning/plugins/ddp_plugin.py

# Conflicts: # pytorch_lightning/trainer/training_loop.py

mergify · 2020-12-12T15:00:27Z

This pull request is now in conflict... :(

SeanNaren added the feature Is an improvement or enhancement label Oct 15, 2020

SeanNaren self-assigned this Oct 15, 2020

SeanNaren requested review from a team and Borda and removed request for Borda October 15, 2020 15:44

SeanNaren changed the title ~~Introduce fairscale accelerator~~ [WIP] Introduce fairscale accelerator Oct 18, 2020

edenlightning added this to the 1.1 milestone Oct 19, 2020

SeanNaren force-pushed the feature/817-fairscale branch from 2911102 to d000ca9 Compare October 22, 2020 14:25

SeanNaren mentioned this pull request Oct 23, 2020

Add model sharding/gradient checkpointing from FairScale #4322

Closed

SeanNaren force-pushed the feature/817-fairscale branch from d000ca9 to 9b9dd9f Compare October 24, 2020 18:19

SeanNaren mentioned this pull request Oct 30, 2020

Initial integration with FairScale Pipe Module for model partitioning/gradient checkpointing #4443

Closed

4 tasks

SeanNaren changed the title ~~[WIP] Introduce fairscale accelerator~~ [WIP] Introduce Sharded Accelerator Nov 1, 2020

SeanNaren added 5 commits November 5, 2020 09:59

Added base fairscale accelerator and dependency. modified checkpoint …

568b35e

…to sync optimizer state before saving

Added wrapper class to ensure we only call state_dict on rank zero

630796b

Added additional comment from override, fixed over-identation

133a250

Added wrapper for sharded ddp

eb552ef

Update state_dict call, allow every process to call function, only wa…

3439f47

…rn on rank zero

williamFalcon and others added 7 commits November 15, 2020 11:00

Merge branch 'master' into feature/817-fairscale-3n-redo

da4c022

Swapped name to on_before_forward to align with hooks in the future

ca6d536

Merge branch 'master' into feature/817-fairscale-3n-redo

bbe5760

Merge branch 'master' into feature/817-fairscale-2n

3b90d6b

Merge branch 'master' into feature/817-fairscale-2n

369d8c7

Merge branch 'master' into feature/817-fairscale-3n-redo

1125ae3

Expose scaler in amp plugin

61fc39c

SeanNaren mentioned this pull request Nov 18, 2020

Sharded Plugin 4/n: Expose scaler in amp plugin #4737

Merged

4 tasks

SeanNaren and others added 8 commits November 18, 2020 11:00

Merge branch 'master' into feature/817-fairscale-4n

681217b

Merge branch 'master' into feature/817-fairscale

9e5a4d8

Merge branch 'feature/817-fairscale-2n' into feature/817-fairscale

a3d9680

# Conflicts: # pytorch_lightning/accelerators/accelerator.py # pytorch_lightning/plugins/ddp_plugin.py # pytorch_lightning/trainer/connectors/checkpoint_connector.py

Merge branch 'feature/817-fairscale-4n' into feature/817-fairscale

ff49a59

# Conflicts: # pytorch_lightning/trainer/training_loop.py

Merged pending PRs to unify API, updated to use latest sharded DDP

8aa9bd2

Fixed var call

115d498

temp

f461094

Borda requested review from Borda, ananyahjha93 and awaelchli as code owners November 30, 2020 17:57

Borda marked this pull request as ready for review November 30, 2020 17:57

Borda requested review from justusschock, nateraw, tchaton, teddykoker and williamFalcon as code owners November 30, 2020 17:57

Borda changed the title ~~[WIP] Introduce Sharded Plugin~~ Sharded Plugin Nov 30, 2020

edenlightning removed this from the 1.1 milestone Dec 8, 2020

SeanNaren closed this Dec 12, 2020

SeanNaren deleted the feature/817-fairscale branch December 12, 2020 17:25

Sharded Plugin #4178

Sharded Plugin #4178

Uh oh!

Conversation

SeanNaren commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

pep8speaks commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-11-24 21:16:57 UTC

Uh oh!

SeanNaren commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blefaudeux commented Oct 16, 2020

Uh oh!

SeanNaren commented Oct 16, 2020

Uh oh!

blefaudeux commented Oct 16, 2020

Uh oh!

SeanNaren commented Oct 16, 2020

Uh oh!

blefaudeux commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanNaren commented Oct 16, 2020

Uh oh!

SeanNaren commented Oct 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanNaren commented Oct 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanNaren commented Oct 28, 2020

Uh oh!

ananthsub commented Oct 28, 2020

Uh oh!

SeanNaren commented Oct 28, 2020

Uh oh!

mergify bot commented Dec 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SeanNaren commented Oct 15, 2020 •

edited

Loading

pep8speaks commented Oct 15, 2020 •

edited

Loading

SeanNaren commented Oct 16, 2020 •

edited

Loading

blefaudeux commented Oct 16, 2020 •

edited

Loading

SeanNaren commented Oct 18, 2020 •

edited

Loading

SeanNaren commented Oct 25, 2020 •

edited

Loading