Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented May 3, 2021

What does this PR do?

Fixes PR merge blocking on master.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

Hello @awaelchli! Thanks for opening this PR.

Line 206:36: W292 no newline at end of file

Do see the Hitchhiker's guide to code style

@awaelchli awaelchli added ci Continuous Integration priority: 0 High priority task labels May 3, 2021
@codecov
Copy link

codecov bot commented May 3, 2021

Codecov Report

Merging #7326 (e857eeb) into master (e0c64f0) will increase coverage by 3%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7326    +/-   ##
=======================================
+ Coverage      87%     90%    +3%     
=======================================
  Files         200     200            
  Lines       12865   13051   +186     
=======================================
+ Hits        11210   11739   +529     
+ Misses       1655    1312   -343     

@awaelchli awaelchli force-pushed the ci/deepspeed-version branch from e0dd14b to 746928f Compare May 3, 2021 10:17
@leezu
Copy link
Contributor

leezu commented May 3, 2021

@awaelchli is there a tracking issue or other resource detailing the problems with deepspeed>0.3.14?

@awaelchli
Copy link
Contributor Author

Yes, a failing test after Deepspeed updated to a new version
https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=7176&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=d9f671c5-a304-5675-5394-961fd7f98b9b

_______________________ test_deepspeed_multigpu_stage_3 ________________________

tmpdir = local('/tmp/pytest-of-AzDevOps_azpcontainer/pytest-7/test_deepspeed_multigpu_stage_0')
deepspeed_config = {'optimizer': {'params': {'lr': 3e-05}, 'type': 'SGD'}, 'scheduler': {'params': {'last_batch_iteration': -1, 'warmup_max_lr': 3e-05, 'warmup_min_lr': 0, 'warmup_num_steps': 100}, 'type': 'WarmupLR'}}

    @RunIf(min_gpus=2, deepspeed=True, special=True)
    def test_deepspeed_multigpu_stage_3(tmpdir, deepspeed_config):
        """
        Test to ensure ZeRO Stage 3 works with a parallel model.
        """
        model = ModelParallelBoringModel()
        trainer = Trainer(
            plugins=[DeepSpeedPlugin(stage=3)],
            default_root_dir=tmpdir,
            gpus=2,
            fast_dev_run=True,
            precision=16,
        )
>       trainer.fit(model)

tests/plugins/test_deepspeed_plugin.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pytorch_lightning/trainer/trainer.py:869: in fit
    self._run(model)
pytorch_lightning/trainer/trainer.py:473: in _run
    self.pre_dispatch()
pytorch_lightning/trainer/trainer.py:498: in pre_dispatch
    self.accelerator.pre_dispatch(self)
pytorch_lightning/accelerators/accelerator.py:108: in pre_dispatch
    self.training_type_plugin.pre_dispatch()
pytorch_lightning/plugins/training_type/deepspeed.py:241: in pre_dispatch
    self.init_deepspeed()
pytorch_lightning/plugins/training_type/deepspeed.py:258: in init_deepspeed
    self._initialize_deepspeed_train(model)
pytorch_lightning/plugins/training_type/deepspeed.py:287: in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py:120: in initialize
    engine = DeepSpeedEngine(args=args,
/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py:149: in __init__
    self._configure_distributed_model(model)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DeepSpeedEngine(
  (module): LightningDeepSpeedModule(
    (module): ModelParallelBoringModel(
      (layer): Linear(in_features=32, out_features=2, bias=True)
      (linear): Linear(in_features=32, out_features=2, bias=True)
    )
  )
)
model = LightningDeepSpeedModule(
  (module): ModelParallelBoringModel(
    (layer): Linear(in_features=32, out_features=2, bias=True)
    (linear): Linear(in_features=32, out_features=2, bias=True)
  )
)

    def _configure_distributed_model(self, model):
        self.module = model
        if self.fp16_enabled():
            if self.zero_optimization_partition_weights() and any(
                [hasattr(param,
                         'ds_id') for param in self.module.parameters()]):
>               assert all([param.dtype == torch.half for param in self.module.parameters()]), f"Model must initialized in fp16 mode for ZeRO Stage 3."
E               AssertionError: Model must initialized in fp16 mode for ZeRO Stage 3.

/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py:569: AssertionError

@awaelchli awaelchli marked this pull request as ready for review May 3, 2021 14:07
@awaelchli awaelchli requested review from Borda and tchaton as code owners May 3, 2021 14:07
@awaelchli awaelchli added this to the v1.3 milestone May 3, 2021
@awaelchli awaelchli enabled auto-merge (squash) May 3, 2021 17:25
@awaelchli awaelchli merged commit 7636d42 into master May 3, 2021
@awaelchli awaelchli deleted the ci/deepspeed-version branch May 3, 2021 18:21
kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration priority: 0 High priority task

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants