Skip to content

Conversation

@sebastienwood
Copy link

@sebastienwood sebastienwood commented Aug 6, 2020

What does this PR do?

Fixes #2075 (issue)
Fixes #1716

New behavior of auto_select_gpu

  • the old behavior is run first, i.e. to check if we can use the GPUs
  • the new behavior quick in when the model and dataset are defined
  • will possibly pick less GPUs than requested if one of them is not available for the workload
  • update constants afterward (root_gpu, on_gpu, trigger set_nvidia_flags)

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Notes:

  • the test does not feel satisfying, looking for ideas to improve it
  • the documentation seems to already convey the effect of this new auto_select_gpu behavior, hence no update on that side
  • it relies on the definition of the dataloader to work properly, as it is sampled to represent the actual workload
  • I have doubts the method is not run too "late" with respect to the env. configuration for the training process
  • my first PR on an OSS project with tests and stuff, any comment is welcome !

@pep8speaks
Copy link

pep8speaks commented Aug 6, 2020

Hello @sebastienwood! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 18:01:18 UTC

@mergify mergify bot requested a review from a team August 6, 2020 16:30
@Borda Borda added the feature Is an improvement or enhancement label Aug 6, 2020
@Borda Borda added this to the 0.9.0 milestone Aug 6, 2020
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation won't work due to the specified reasons. Another question: Is this called before DP/DDP initialization? Otherwise it won't work with those

@mergify mergify bot requested a review from a team August 7, 2020 06:33
@SpontaneousDuck
Copy link
Contributor

A little hesitant about it automatically selecting fewer GPUs than requested if not enough are usable, even if it warns the user. Might be useful to have a flag that would allow the user to choose if this throws an error or automatically reduces number of GPUs.

@justusschock
Copy link
Member

@SpontaneousDuck actually I strongly disagree here. We have already so many flags, that I don't want to have another one. But I agree that the correct behaviour should be raising an error

@sebastienwood
Copy link
Author

The current implementation won't work due to the specified reasons. Another question: Is this called before DP/DDP initialization? Otherwise it won't work with those

It is called line 975 in trainer.py, before the whole if/elif of accelerator backends which setup DP/DDP.
Would model_device.training_step(batch_device, 0) be ok for line 487 ? It should delegate batch handling to the user.

@SpontaneousDuck
Copy link
Contributor

SpontaneousDuck commented Aug 7, 2020

@justusschock Sounds great to me! Just was a little worried about reproducibility since a variable number of GPUs can affect that. Comparison of different training runs can also be different if the number of GPUs used is variable. Just throwing an error instead would solve that. I do agree there are already a ton of flags!

@mergify
Copy link
Contributor

mergify bot commented Aug 7, 2020

This pull request is now in conflict... :(

Copy link
Collaborator

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is looking good. Remember to add to documentation this behavior (that lightning will actively look for gpus which have enough space)

Question: is it possible to add a test where we have two gpus, still request 1 with auto_select_gpus=True and then we artificially fill up gpu 0 and assert that we actually pick gpu 1?

# Called when model/data is known. Ensure the GPU used have enough VRAM.
self.gpus = pick_multiple_gpus(len(self.gpus), model) # At most the current number of GPUs

self.data_parallel_device_ids = _parse_gpu_ids(self.gpus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function be used during __init__ to not have duplicate code?

self.setup(stage_name)
model.setup(stage_name)

def update_auto_selected_gpus(self, model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def update_auto_selected_gpus(self, model):
def update_auto_selected_gpus(self, model: LightningModule):

@mergify mergify bot requested a review from a team August 13, 2020 12:42
@justusschock
Copy link
Member

@sebastienwood While this PR seems to be good, I have another question:

Let's say I have a system with two GPUs, which are both empty at the beginning and can both hold 10GB.

My Model (including inputs, activations and gradients) takes only 4GB. So When I now tell lightning to search for a free GPU, it will use the first one (which is great). But if I want to run the same experiment in parallel with another config, it would probably select the same GPU again, since there is still memory left.

However, if my model is computationally expensive, that the GPU utilisation is always high even though I don't need much memory, it will clearly slow down everything. Can't we use free GPUs if available or maybe go for utilisation first and then as a second order criteria for memory?

@williamFalcon
Copy link
Contributor

yeah, i agree with @justusschock. We don’t want to add multiple models for the same GPU. for our purposes we should consider any GPU that is being utilized as “full” (even if the memory is only 5% for example)

@justusschock
Copy link
Member

@williamFalcon actually, we shouldn't be that strict. We should search for empty GPUs, but if no empty gpu is there, you should use the ones, that actually fit from memory perspective, since on workstations you often also have the x-server allocating some GPU Memory (100-200 MB typically)

@williamFalcon
Copy link
Contributor

yeah, but the problem is that if you are sharing this with someone you’ll put your work on their gpu and cut their speed in half

@justusschock
Copy link
Member

That's why I wanted to look at GPU-Utilization instead :)

@awaelchli
Copy link
Contributor

awaelchli commented Aug 14, 2020

Here a compromise/proposition: Implement the strict version in such a way that it can easily be overridden by anyone who wants different behavior or has special requirements on the type of gpu etc.
Instead of a function, make it a class that implements the algorithm. Break it down into the fundamental blocks. Let the user override parts of it or the whole algorithm. Then:

trainer = Trainer(gpus="auto") # default algorithm
trainer = Trainer(gpus=SophisticatedGPUChooser())  # custom algorithm

@sebastienwood
Copy link
Author

sebastienwood commented Aug 14, 2020

@awaelchli this is appealing. There are a limited number of scenarios (based on the discussions):

  • less than the request number of GPU could accept the workload, throw an error
  • enough capable GPUs were found, but they are already in use : go/nogo decision
  • enough capable GPUs were found, they're free

The only block to clear is the go/nogo decision. Is it ok to have a simple API for the decision, based on the current maximal usage ratio of the VRAM for the capable GPUs ?

E.g. we have one GPU, we request one. Its VRAM load is 25%. It is capable. criterion_function(0.25) -> bool returns the go/nogo decision.

@Borda Borda modified the milestones: 1.0.0, 0.9.x Aug 20, 2020
@edenlightning
Copy link
Contributor

mind rebasing?

@edenlightning edenlightning modified the milestones: 0.9.x, 1.0, 1.1 Oct 4, 2020
@edenlightning
Copy link
Contributor

@williamFalcon @awaelchli what do you think of the latest suggestion?

@awaelchli
Copy link
Contributor

If a GPU is in use, we should not pick it as a candidate. Even if the memory allows it, you can't really judge the GPU utilisation, and I have not made good experience, with processes crashing. I think it should be one process per gpu max.

@stale
Copy link

stale bot commented Nov 7, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 7, 2020
@stale
Copy link

stale bot commented Nov 12, 2020

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Nov 12, 2020
@HanSilver01
Copy link

![Cargando 98921748...]

@qsh-zh
Copy link

qsh-zh commented Nov 3, 2021

@awaelchli Your suggestion is appealing. I am new to PyTorch lightning, do we have such a feature that we can select the best available device?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Is an improvement or enhancement won't fix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancing auto_select_GPU Document auto_select_gpus