Skip to content

Conversation

@sneiman
Copy link
Contributor

@sneiman sneiman commented Mar 18, 2020

Fix #1161 - when using ddp/ddpd2, the validation and training loops run the full respective dataset on each gpu. This costs time, and changes batch counts for any statistics being collected.

The fix just makes sure that for ddp and ddp2, auto_add_sampler() creates a DistributedSampler for each data set.

This passes all the tests on my machine except for slurm and apex related as I do not have either. I don't think this needs any doc changes. I can look into writing a test for this ... if needed. Let me know.

@Borda Borda changed the title Issue 1161 validation and training loops run the partial dataset Mar 18, 2020
@Borda Borda added the docs Documentation related label Mar 18, 2020
@Borda Borda requested review from ethanwharris and neggert March 18, 2020 23:35
@williamFalcon
Copy link
Contributor

@srush mind taking a look? this came from our chats with the HF code.

@srush
Copy link
Contributor

srush commented Mar 30, 2020

This seems good to me. We have some val sets that are quite large.

@Borda Borda requested review from a team, jeffling and jeremyjordan March 30, 2020 16:11
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@Borda Borda added the ready PRs ready to be merged label Mar 30, 2020
@williamFalcon williamFalcon merged commit 6dfe995 into Lightning-AI:master Mar 30, 2020
alexeykarnachev pushed a commit to alexeykarnachev/pytorch-lightning that referenced this pull request Apr 3, 2020
)

* auto_add_sampler() fix

* auto_add_sampler() fix

Co-authored-by: seth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation related ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi-gpu ddp calls validation and testing loops too many times

5 participants