[bugfix] Prevent a DDP failure using copy #9239

tchaton · 2021-08-31T19:31:55Z

What does this PR do?

This error is critical as it blocks DDP. Observations:

needs num_workers > 0
only for ddp (spawn works)
returning a loss has impact
epoch=True breaks with any reduce_fx
step=True, sync_dist=True breaks with reduce_fx != "mean"

The failure can be reproduced with this minimal repro: #8821 (comment)

We believe this to be a bug upstream on PyTorch but haven't been able to easily reproduce it without Lightning.

A test hasn't been added because pytest seems to hang on teardown due to the num_workers>0 requirement. Not sure why.

This would break on master.

@RunIf(min_gpus=1)
def test_ddp_requires_a_deepcopy_on_training_step_output(tmpdir):

    class TestModel(BoringModel):

        def training_step(self, batch, batch_idx):
            loss = self(batch).sum()
            self.log('foo', torch.tensor(1), on_epoch=True)
            self.log('bar', torch.tensor(1), on_step=True, reduce_fx="sum", sync_dist=True)
            # a loss needs to be returned!
            return loss

        def train_dataloader(self):
            # only fails with `num_workers>0`
            return DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=1)
 
    model = TestModel()
    trainer = Trainer(
        default_root_dir=tmpdir,
        gpus=1,
        accelerator='ddp',
        limit_train_batches=1,
        max_epochs=5,
        checkpoint_callback=False,
        logger=False,
    )
    trainer.fit(model)

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[n/a] Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

for more information, see https://pre-commit.ci

…ytorch-lightning into crazy_bug_fix

for more information, see https://pre-commit.ci

…ytorch-lightning into crazy_bug_fix

for more information, see https://pre-commit.ci

pytorch_lightning/trainer/connectors/logger_connector/result.py

codecov · 2021-08-31T20:55:39Z

Codecov Report

Merging #9239 (c43190a) into master (3e71046) will increase coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9239    +/-   ##
=======================================
+ Coverage      88%     92%    +4%     
=======================================
  Files         176     176            
  Lines       14807   14810     +3     
=======================================
+ Hits        13043   13663   +620     
+ Misses       1764    1147   -617

awaelchli · 2021-08-31T22:00:47Z

pytorch_lightning/trainer/connectors/logger_connector/result.py

+        if not enable_graph:
+
+            def detach_fn(tensor: Tensor) -> Tensor:
+                return tensor.detach()


I tried to add a clone() here but doesn't solve the issue. I'm just perplexed why the deepcopy is necessary. Do you have any intuition what is going on?

tchaton · 2021-09-07T14:54:20Z

Hey @awaelchli,

Mind sharing more details on what you are trying to do ?

Best,
T.C

This reverts commit ff7305f.

resolve critical bug

d224d11

tchaton requested review from awaelchli, carmocca and justusschock as code owners August 31, 2021 19:31

carmocca added this to the v1.4.x milestone Aug 31, 2021

carmocca added distributed Generic distributed-related topic priority: 0 High priority task labels Aug 31, 2021

update changelog

4c47a93

tchaton requested review from Borda, SeanNaren, kaushikb11 and williamFalcon as code owners August 31, 2021 19:35

tchaton and others added 5 commits August 31, 2021 15:35

add back test

2702e90

[pre-commit.ci] auto fixes from pre-commit.com hooks

899c2e0

for more information, see https://pre-commit.ci

typo

d93c38b

Merge branch 'crazy_bug_fix' of https://github.com/PyTorchLightning/p…

d2f45bc

…ytorch-lightning into crazy_bug_fix

[pre-commit.ci] auto fixes from pre-commit.com hooks

7530cd9

for more information, see https://pre-commit.ci

tchaton enabled auto-merge (squash) August 31, 2021 19:39

Fixes

e565fd9

carmocca approved these changes Aug 31, 2021

View reviewed changes

kaushikb11 approved these changes Aug 31, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Aug 31, 2021

ethanwharris approved these changes Aug 31, 2021

View reviewed changes

tchaton and others added 6 commits August 31, 2021 16:22

update

6beb865

Merge branch 'crazy_bug_fix' of https://github.com/PyTorchLightning/p…

b41d926

…ytorch-lightning into crazy_bug_fix

update

ddf633d

[pre-commit.ci] auto fixes from pre-commit.com hooks

3bbf974

for more information, see https://pre-commit.ci

update

0f8d6ea

[pre-commit.ci] auto fixes from pre-commit.com hooks

829eb65

for more information, see https://pre-commit.ci

carmocca reviewed Aug 31, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/logger_connector/result.py Outdated Show resolved Hide resolved

tchaton added 2 commits August 31, 2021 16:35

cleanup

b2b436f

merge

c43190a

SeanNaren approved these changes Aug 31, 2021

View reviewed changes

tchaton merged commit ff7305f into master Aug 31, 2021

tchaton deleted the crazy_bug_fix branch August 31, 2021 21:02

ethanwharris pushed a commit that referenced this pull request Aug 31, 2021

[bugfix] Prevent a DDP failure using copy (#9239)

ac835fb

awaelchli added the bug Something isn't working label Aug 31, 2021

awaelchli reviewed Aug 31, 2021

View reviewed changes

lexierule pushed a commit that referenced this pull request Sep 1, 2021

[bugfix] Prevent a DDP failure using copy (#9239)

7312d2f

leezu mentioned this pull request Sep 7, 2021

Weekly Patch Release v1.4.6 [full merge, no squash] #9358

Merged

12 tasks

leezu mentioned this pull request Sep 30, 2021

Share the training step output data via ClosureResult #9349

Merged

12 tasks

leezu added a commit to leezu/pytorch-lightning that referenced this pull request Sep 30, 2021

Revert "[bugfix] Prevent a DDP failure using copy (Lightning-AI#9239)"

107c9c2

This reverts commit ff7305f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] Prevent a DDP failure using copy #9239

[bugfix] Prevent a DDP failure using copy #9239

Uh oh!

tchaton commented Aug 31, 2021 •

edited by carmocca

Loading

Uh oh!

Uh oh!

codecov bot commented Aug 31, 2021 •

edited

Loading

Uh oh!

awaelchli Aug 31, 2021

Uh oh!

tchaton commented Sep 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[bugfix] Prevent a DDP failure using copy #9239

[bugfix] Prevent a DDP failure using copy #9239

Uh oh!

Conversation

tchaton commented Aug 31, 2021 • edited by carmocca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Uh oh!

Uh oh!

codecov bot commented Aug 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

awaelchli Aug 31, 2021

Choose a reason for hiding this comment

Uh oh!

tchaton commented Sep 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tchaton commented Aug 31, 2021 •

edited by carmocca

Loading

codecov bot commented Aug 31, 2021 •

edited

Loading