-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Updated auto select gpu #2852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated auto select gpu #2852
Conversation
|
Hello @sebastienwood! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-08-07 18:01:18 UTC |
pep8 fix
justusschock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation won't work due to the specified reasons. Another question: Is this called before DP/DDP initialization? Otherwise it won't work with those
|
A little hesitant about it automatically selecting fewer GPUs than requested if not enough are usable, even if it warns the user. Might be useful to have a flag that would allow the user to choose if this throws an error or automatically reduces number of GPUs. |
|
@SpontaneousDuck actually I strongly disagree here. We have already so many flags, that I don't want to have another one. But I agree that the correct behaviour should be raising an error |
It is called line 975 in trainer.py, before the whole if/elif of accelerator backends which setup DP/DDP. |
|
@justusschock Sounds great to me! Just was a little worried about reproducibility since a variable number of GPUs can affect that. Comparison of different training runs can also be different if the number of GPUs used is variable. Just throwing an error instead would solve that. I do agree there are already a ton of flags! |
|
This pull request is now in conflict... :( |
SkafteNicki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is looking good. Remember to add to documentation this behavior (that lightning will actively look for gpus which have enough space)
Question: is it possible to add a test where we have two gpus, still request 1 with auto_select_gpus=True and then we artificially fill up gpu 0 and assert that we actually pick gpu 1?
| # Called when model/data is known. Ensure the GPU used have enough VRAM. | ||
| self.gpus = pick_multiple_gpus(len(self.gpus), model) # At most the current number of GPUs | ||
|
|
||
| self.data_parallel_device_ids = _parse_gpu_ids(self.gpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be used during __init__ to not have duplicate code?
| self.setup(stage_name) | ||
| model.setup(stage_name) | ||
|
|
||
| def update_auto_selected_gpus(self, model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def update_auto_selected_gpus(self, model): | |
| def update_auto_selected_gpus(self, model: LightningModule): |
|
@sebastienwood While this PR seems to be good, I have another question: Let's say I have a system with two GPUs, which are both empty at the beginning and can both hold 10GB. My Model (including inputs, activations and gradients) takes only 4GB. So When I now tell lightning to search for a free GPU, it will use the first one (which is great). But if I want to run the same experiment in parallel with another config, it would probably select the same GPU again, since there is still memory left. However, if my model is computationally expensive, that the GPU utilisation is always high even though I don't need much memory, it will clearly slow down everything. Can't we use free GPUs if available or maybe go for utilisation first and then as a second order criteria for memory? |
|
yeah, i agree with @justusschock. We don’t want to add multiple models for the same GPU. for our purposes we should consider any GPU that is being utilized as “full” (even if the memory is only 5% for example) |
|
@williamFalcon actually, we shouldn't be that strict. We should search for empty GPUs, but if no empty gpu is there, you should use the ones, that actually fit from memory perspective, since on workstations you often also have the x-server allocating some GPU Memory (100-200 MB typically) |
|
yeah, but the problem is that if you are sharing this with someone you’ll put your work on their gpu and cut their speed in half |
|
That's why I wanted to look at GPU-Utilization instead :) |
|
Here a compromise/proposition: Implement the strict version in such a way that it can easily be overridden by anyone who wants different behavior or has special requirements on the type of gpu etc. trainer = Trainer(gpus="auto") # default algorithm
trainer = Trainer(gpus=SophisticatedGPUChooser()) # custom algorithm |
|
@awaelchli this is appealing. There are a limited number of scenarios (based on the discussions):
The only block to clear is the go/nogo decision. Is it ok to have a simple API for the decision, based on the current maximal usage ratio of the VRAM for the capable GPUs ? E.g. we have one GPU, we request one. Its VRAM load is 25%. It is capable. |
|
mind rebasing? |
|
@williamFalcon @awaelchli what do you think of the latest suggestion? |
|
If a GPU is in use, we should not pick it as a candidate. Even if the memory allows it, you can't really judge the GPU utilisation, and I have not made good experience, with processes crashing. I think it should be one process per gpu max. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
|
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
|
![Cargando 98921748...] |
|
@awaelchli Your suggestion is appealing. I am new to PyTorch lightning, do we have such a feature that we can select the best available device? |
What does this PR do?
Fixes #2075 (issue)
Fixes #1716
New behavior of auto_select_gpu
root_gpu,on_gpu, triggerset_nvidia_flags)Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃
Notes:
auto_select_gpubehavior, hence no update on that side