-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Improving Hydra+DDP support #11617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Hydra+DDP support #11617
Conversation
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really neat! Mind moving the hydra-specific logic to its own utility file?
|
@tchaton I'm working through some issues with |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
tests/tests_pytorch/strategies/launchers/test_subprocess_script.py
Outdated
Show resolved
Hide resolved
|
Finally merged! Thank you so much for your time @jgbos |
Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: rohitgr7 <[email protected]> Co-authored-by: Jirka <[email protected]>
|
@jgbos @awaelchli @carmocca @rohitgr7
These issues was not happended with hydra-submitit-launcher (with slurm). |
|
@Liangtaiwan Not sure what is going on there. Could you report the exact issue, steps to reproduce, and environment you run in a GitHub issue? Perhaps @jgbos will be able to help too? |
|
@Liangtaiwan sorry, I would need to know more about the error. Definitely open an issue and I will try to help. Would be good to know Hydra version and how you executed the script. Also, this error should not occur if launching on slurm as each gpu task is launched via the submission script (i.e., submitit launcher) instead of this script. |
This reverts commit 45ca781.
This reverts commit 45ca781.
This reverts commit 45ca781.
What does this PR do?
There are documented issues for Hydra+DDP in #11300 and #2727. This PR attempts to fix these issues by redefining how a process is spawned when Hydra is used.
Fixes #11300
Problem
Current approach is defined here: https://github.com/PyTorchLightning/pytorch-lightning/blob/fe34bf2a653ebd50e6a3a00be829e3611f820c3c/pytorch_lightning/strategies/ddp.py#L233
This PR addresses the issue of running with Hydra
multirun. For example, lets say we have the following Hydra app:We can execute a
multirunjob with this app using the following command:This command will attempt to launch 2 jobs sequentially: one with
foo=1and one withfoo=2. For the first job,foo=1, PL launchers a normal job that begins execution while the second job is spawned with a subprocess using the following command derived fromsys.argv:This will spawn a new
mutlirunjob instead of running a normal job withfoo=2. The command should beSolution
Every Hydra process will save the reproducible configuration of the job in a
config.yamlfile located in thehydra.output_subdirexperiment directory. Using Hydra's CLI, we can execute the app with same configuration as the current experiment by defining--config_path, -cpand--config_name, -cn:Here
config.yamlcontains the value forfooappropriate for the currentmultirunjob. I've outlined the support formultirun, but this should support any Hydra application launched from the command line.Lingering Issue Not Solved
In order to run
multirunon a local machine the user must add additional code to their task function before launching the next job. This code, shown below, will destroy all distributed processes and remove PL related environment variables. Without this code themultirunjob will hang after the first job.Thoughts
This solution should help most people, but IMHO the current method of spawning using
sys.argvis not robust. It would be nice to be able to execute a PL application similar to how I implemented the Hydra solution. This would require dynamically creating a configurations for the application. The use ofsave_hyperparametersis nice but is only done inside the creation of aLightningModule, it would be nice to create a description of the model at a higher level — outside of the module. By defining the description outside of the module, one could use a similar approach as the Hydra solution above for PL applications. I would recommend taking a look at hydra-zen (FYI, I'm a contributor).Does your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
Personal TODO:
PL TODO: