Update DeepSpeed version requirement in Dockerfile #7326

awaelchli · 2021-05-03T10:15:29Z

What does this PR do?

Fixes PR merge blocking on master.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-05-03T10:15:32Z

Hello @awaelchli! Thanks for opening this PR.

In the file pl_examples/repro_vae.py:

Line 206:36: W292 no newline at end of file

Do see the Hitchhiker's guide to code style

codecov · 2021-05-03T10:16:52Z

Codecov Report

Merging #7326 (e857eeb) into master (e0c64f0) will increase coverage by 3%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7326    +/-   ##
=======================================
+ Coverage      87%     90%    +3%     
=======================================
  Files         200     200            
  Lines       12865   13051   +186     
=======================================
+ Hits        11210   11739   +529     
+ Misses       1655    1312   -343

dockers/base-cuda/Dockerfile

leezu · 2021-05-03T11:58:10Z

@awaelchli is there a tracking issue or other resource detailing the problems with deepspeed>0.3.14?

awaelchli · 2021-05-03T12:37:43Z

Yes, a failing test after Deepspeed updated to a new version
https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=7176&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=d9f671c5-a304-5675-5394-961fd7f98b9b

_______________________ test_deepspeed_multigpu_stage_3 ________________________

tmpdir = local('/tmp/pytest-of-AzDevOps_azpcontainer/pytest-7/test_deepspeed_multigpu_stage_0')
deepspeed_config = {'optimizer': {'params': {'lr': 3e-05}, 'type': 'SGD'}, 'scheduler': {'params': {'last_batch_iteration': -1, 'warmup_max_lr': 3e-05, 'warmup_min_lr': 0, 'warmup_num_steps': 100}, 'type': 'WarmupLR'}}

    @RunIf(min_gpus=2, deepspeed=True, special=True)
    def test_deepspeed_multigpu_stage_3(tmpdir, deepspeed_config):
        """
        Test to ensure ZeRO Stage 3 works with a parallel model.
        """
        model = ModelParallelBoringModel()
        trainer = Trainer(
            plugins=[DeepSpeedPlugin(stage=3)],
            default_root_dir=tmpdir,
            gpus=2,
            fast_dev_run=True,
            precision=16,
        )
>       trainer.fit(model)

tests/plugins/test_deepspeed_plugin.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pytorch_lightning/trainer/trainer.py:869: in fit
    self._run(model)
pytorch_lightning/trainer/trainer.py:473: in _run
    self.pre_dispatch()
pytorch_lightning/trainer/trainer.py:498: in pre_dispatch
    self.accelerator.pre_dispatch(self)
pytorch_lightning/accelerators/accelerator.py:108: in pre_dispatch
    self.training_type_plugin.pre_dispatch()
pytorch_lightning/plugins/training_type/deepspeed.py:241: in pre_dispatch
    self.init_deepspeed()
pytorch_lightning/plugins/training_type/deepspeed.py:258: in init_deepspeed
    self._initialize_deepspeed_train(model)
pytorch_lightning/plugins/training_type/deepspeed.py:287: in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py:120: in initialize
    engine = DeepSpeedEngine(args=args,
/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py:149: in __init__
    self._configure_distributed_model(model)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DeepSpeedEngine(
  (module): LightningDeepSpeedModule(
    (module): ModelParallelBoringModel(
      (layer): Linear(in_features=32, out_features=2, bias=True)
      (linear): Linear(in_features=32, out_features=2, bias=True)
    )
  )
)
model = LightningDeepSpeedModule(
  (module): ModelParallelBoringModel(
    (layer): Linear(in_features=32, out_features=2, bias=True)
    (linear): Linear(in_features=32, out_features=2, bias=True)
  )
)

    def _configure_distributed_model(self, model):
        self.module = model
        if self.fp16_enabled():
            if self.zero_optimization_partition_weights() and any(
                [hasattr(param,
                         'ds_id') for param in self.module.parameters()]):
>               assert all([param.dtype == torch.half for param in self.module.parameters()]), f"Model must initialized in fp16 mode for ZeRO Stage 3."
E               AssertionError: Model must initialized in fp16 mode for ZeRO Stage 3.

/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py:569: AssertionError

dockers/base-cuda/Dockerfile

.github/workflows/events-nightly.yml

Co-authored-by: Carlos Mocholí <[email protected]>

awaelchli added ci Continuous Integration priority: 0 High priority task labels May 3, 2021

update

746928f

awaelchli force-pushed the ci/deepspeed-version branch from e0dd14b to 746928f Compare May 3, 2021 10:17

awaelchli commented May 3, 2021

View reviewed changes

dockers/base-cuda/Dockerfile Outdated Show resolved Hide resolved

awaelchli added 2 commits May 3, 2021 12:54

fixed version

d86c345

add back comment

657f932

carmocca reviewed May 3, 2021

View reviewed changes

dockers/base-cuda/Dockerfile Outdated Show resolved Hide resolved

Jirka magic

548d332

carmocca reviewed May 3, 2021

View reviewed changes

dockers/base-cuda/Dockerfile Outdated Show resolved Hide resolved

Apply suggestions from code review

2df440c

awaelchli commented May 3, 2021

View reviewed changes

.github/workflows/events-nightly.yml Show resolved Hide resolved

pull

e857eeb

awaelchli marked this pull request as ready for review May 3, 2021 14:07

awaelchli requested review from Borda and tchaton as code owners May 3, 2021 14:07

awaelchli added this to the v1.3 milestone May 3, 2021

awaelchli added the Important label May 3, 2021

kaushikb11 approved these changes May 3, 2021

View reviewed changes

ananthsub approved these changes May 3, 2021

View reviewed changes

ethanwharris approved these changes May 3, 2021

View reviewed changes

awaelchli enabled auto-merge (squash) May 3, 2021 17:25

carmocca approved these changes May 3, 2021

View reviewed changes

Borda approved these changes May 3, 2021

View reviewed changes

awaelchli merged commit 7636d42 into master May 3, 2021

awaelchli deleted the ci/deepspeed-version branch May 3, 2021 18:21

kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request May 4, 2021

Update DeepSpeed version requirement in Dockerfile (Lightning-AI#7326)

676d854

Co-authored-by: Carlos Mocholí <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update DeepSpeed version requirement in Dockerfile #7326

Update DeepSpeed version requirement in Dockerfile #7326

Uh oh!

awaelchli commented May 3, 2021 •

edited

Loading

Uh oh!

pep8speaks commented May 3, 2021

Uh oh!

codecov bot commented May 3, 2021 •

edited

Loading

Uh oh!

Uh oh!

leezu commented May 3, 2021

Uh oh!

awaelchli commented May 3, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Update DeepSpeed version requirement in Dockerfile #7326

Update DeepSpeed version requirement in Dockerfile #7326

Uh oh!

Conversation

awaelchli commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

pep8speaks commented May 3, 2021

Uh oh!

codecov bot commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

leezu commented May 3, 2021

Uh oh!

awaelchli commented May 3, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

awaelchli commented May 3, 2021 •

edited

Loading

codecov bot commented May 3, 2021 •

edited

Loading