Updated auto select gpu #2852

sebastienwood · 2020-08-06T16:29:50Z

What does this PR do?

Fixes #2075 (issue)
Fixes #1716

New behavior of auto_select_gpu

the old behavior is run first, i.e. to check if we can use the GPUs
the new behavior quick in when the model and dataset are defined
will possibly pick less GPUs than requested if one of them is not available for the workload
update constants afterward (root_gpu, on_gpu, trigger set_nvidia_flags)

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Notes:

the test does not feel satisfying, looking for ideas to improve it
the documentation seems to already convey the effect of this new auto_select_gpu behavior, hence no update on that side
it relies on the definition of the dataloader to work properly, as it is sampled to represent the actual workload
I have doubts the method is not run too "late" with respect to the env. configuration for the training process
my first PR on an OSS project with tests and stuff, any comment is welcome !

pep8speaks · 2020-08-06T16:29:54Z

Hello @sebastienwood! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 18:01:18 UTC

pep8 fix

justusschock

The current implementation won't work due to the specified reasons. Another question: Is this called before DP/DDP initialization? Otherwise it won't work with those

pytorch_lightning/trainer/distrib_parts.py

SpontaneousDuck · 2020-08-07T15:13:53Z

A little hesitant about it automatically selecting fewer GPUs than requested if not enough are usable, even if it warns the user. Might be useful to have a flag that would allow the user to choose if this throws an error or automatically reduces number of GPUs.

justusschock · 2020-08-07T15:20:04Z

@SpontaneousDuck actually I strongly disagree here. We have already so many flags, that I don't want to have another one. But I agree that the correct behaviour should be raising an error

sebastienwood · 2020-08-07T15:43:21Z

The current implementation won't work due to the specified reasons. Another question: Is this called before DP/DDP initialization? Otherwise it won't work with those

It is called line 975 in trainer.py, before the whole if/elif of accelerator backends which setup DP/DDP.
Would model_device.training_step(batch_device, 0) be ok for line 487 ? It should delegate batch handling to the user.

SpontaneousDuck · 2020-08-07T15:59:11Z

@justusschock Sounds great to me! Just was a little worried about reproducibility since a variable number of GPUs can affect that. Comparison of different training runs can also be different if the number of GPUs used is variable. Just throwing an error instead would solve that. I do agree there are already a ton of flags!

mergify · 2020-08-07T22:34:26Z

This pull request is now in conflict... :(

SkafteNicki

PR is looking good. Remember to add to documentation this behavior (that lightning will actively look for gpus which have enough space)

Question: is it possible to add a test where we have two gpus, still request 1 with auto_select_gpus=True and then we artificially fill up gpu 0 and assert that we actually pick gpu 1?

SkafteNicki · 2020-08-13T12:37:40Z

pytorch_lightning/trainer/trainer.py

+        # Called when model/data is known. Ensure the GPU used have enough VRAM.
+        self.gpus = pick_multiple_gpus(len(self.gpus), model)  # At most the current number of GPUs
+
+        self.data_parallel_device_ids = _parse_gpu_ids(self.gpus)


Could this function be used during __init__ to not have duplicate code?

SkafteNicki · 2020-08-13T12:37:45Z

pytorch_lightning/trainer/trainer.py

        self.setup(stage_name)
        model.setup(stage_name)

+    def update_auto_selected_gpus(self, model):


Suggested change

def update_auto_selected_gpus(self, model):

def update_auto_selected_gpus(self, model: LightningModule):

justusschock · 2020-08-14T07:13:01Z

@sebastienwood While this PR seems to be good, I have another question:

Let's say I have a system with two GPUs, which are both empty at the beginning and can both hold 10GB.

My Model (including inputs, activations and gradients) takes only 4GB. So When I now tell lightning to search for a free GPU, it will use the first one (which is great). But if I want to run the same experiment in parallel with another config, it would probably select the same GPU again, since there is still memory left.

However, if my model is computationally expensive, that the GPU utilisation is always high even though I don't need much memory, it will clearly slow down everything. Can't we use free GPUs if available or maybe go for utilisation first and then as a second order criteria for memory?

williamFalcon · 2020-08-14T11:37:34Z

yeah, i agree with @justusschock. We don’t want to add multiple models for the same GPU. for our purposes we should consider any GPU that is being utilized as “full” (even if the memory is only 5% for example)

justusschock · 2020-08-14T12:02:26Z

@williamFalcon actually, we shouldn't be that strict. We should search for empty GPUs, but if no empty gpu is there, you should use the ones, that actually fit from memory perspective, since on workstations you often also have the x-server allocating some GPU Memory (100-200 MB typically)

williamFalcon · 2020-08-14T12:49:33Z

yeah, but the problem is that if you are sharing this with someone you’ll put your work on their gpu and cut their speed in half

justusschock · 2020-08-14T14:30:33Z

That's why I wanted to look at GPU-Utilization instead :)

awaelchli · 2020-08-14T14:41:45Z

Here a compromise/proposition: Implement the strict version in such a way that it can easily be overridden by anyone who wants different behavior or has special requirements on the type of gpu etc.
Instead of a function, make it a class that implements the algorithm. Break it down into the fundamental blocks. Let the user override parts of it or the whole algorithm. Then:

trainer = Trainer(gpus="auto") # default algorithm
trainer = Trainer(gpus=SophisticatedGPUChooser())  # custom algorithm

sebastienwood · 2020-08-14T22:08:26Z

@awaelchli this is appealing. There are a limited number of scenarios (based on the discussions):

less than the request number of GPU could accept the workload, throw an error
enough capable GPUs were found, but they are already in use : go/nogo decision
enough capable GPUs were found, they're free

The only block to clear is the go/nogo decision. Is it ok to have a simple API for the decision, based on the current maximal usage ratio of the VRAM for the capable GPUs ?

E.g. we have one GPU, we request one. Its VRAM load is 25%. It is capable. criterion_function(0.25) -> bool returns the go/nogo decision.

edenlightning · 2020-09-22T16:09:13Z

mind rebasing?

edenlightning · 2020-10-20T19:07:48Z

@williamFalcon @awaelchli what do you think of the latest suggestion?

awaelchli · 2020-10-23T22:52:12Z

If a GPU is in use, we should not pick it as a candidate. Even if the memory allows it, you can't really judge the GPU utilisation, and I have not made good experience, with processes crashing. I think it should be one process per gpu max.

stale · 2020-11-07T02:06:08Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2020-11-12T02:31:30Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

HanSilver01 · 2020-11-12T09:29:29Z

![Cargando 98921748...]

qsh-zh · 2021-11-03T19:23:32Z

@awaelchli Your suggestion is appealing. I am new to PyTorch lightning, do we have such a feature that we can select the best available device?

sebastienwood and others added 5 commits August 5, 2020 15:41

init

c5065ed

smol doc

ac2ea56

small updates

74ae43f

fix useless import

484f09b

fixes + flake8

2fa02bd

mergify bot requested a review from a team August 6, 2020 16:30

Update distrib_parts.py

5693187

pep8 fix

Borda added the feature Is an improvement or enhancement label Aug 6, 2020

Borda added this to the 0.9.0 milestone Aug 6, 2020

justusschock requested changes Aug 7, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_parts.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team August 7, 2020 06:33

Sebastien Henwood added 2 commits August 7, 2020 14:00

Merge remote-tracking branch 'pl/master' into updated-auto_select_gpu

8ac853e

training step delegation + small fixes

96f3a66

awaelchli modified the milestones: 0.9.0, 1.0.0 Aug 8, 2020

PhilJd mentioned this pull request Aug 10, 2020

Move optimizer creation after device placement for ddp backends. #2904

Merged

7 tasks

SkafteNicki reviewed Aug 13, 2020

View reviewed changes

mergify bot requested a review from a team August 13, 2020 12:42

Borda modified the milestones: 1.0.0, 0.9.x Aug 20, 2020

edenlightning modified the milestones: 0.9.x, 1.0, 1.1 Oct 4, 2020

ssaru mentioned this pull request Oct 21, 2020

fix: nb is set total number of devices, when nb is -1. #4209

Merged

7 tasks

stale bot added the won't fix This will not be worked on label Nov 7, 2020

stale bot closed this Nov 12, 2020

	def update_auto_selected_gpus(self, model):
	def update_auto_selected_gpus(self, model: LightningModule):

Updated auto select gpu #2852

Updated auto select gpu #2852

Uh oh!

Conversation

sebastienwood commented Aug 6, 2020 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

pep8speaks commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-08-07 18:01:18 UTC

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SpontaneousDuck commented Aug 7, 2020

Uh oh!

justusschock commented Aug 7, 2020

Uh oh!

sebastienwood commented Aug 7, 2020

Uh oh!

SpontaneousDuck commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 7, 2020

Uh oh!

SkafteNicki left a comment

Choose a reason for hiding this comment

Uh oh!

SkafteNicki Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

SkafteNicki Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

justusschock commented Aug 14, 2020

Uh oh!

williamFalcon commented Aug 14, 2020

Uh oh!

justusschock commented Aug 14, 2020

Uh oh!

williamFalcon commented Aug 14, 2020

Uh oh!

justusschock commented Aug 14, 2020

Uh oh!

awaelchli commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastienwood commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edenlightning commented Sep 22, 2020

Uh oh!

edenlightning commented Oct 20, 2020

Uh oh!

awaelchli commented Oct 23, 2020

Uh oh!

stale bot commented Nov 7, 2020

Uh oh!

stale bot commented Nov 12, 2020

Uh oh!

HanSilver01 commented Nov 12, 2020

Uh oh!

qsh-zh commented Nov 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

sebastienwood commented Aug 6, 2020 •

edited by Borda

Loading

pep8speaks commented Aug 6, 2020 •

edited

Loading

SpontaneousDuck commented Aug 7, 2020 •

edited

Loading

awaelchli commented Aug 14, 2020 •

edited

Loading

sebastienwood commented Aug 14, 2020 •

edited

Loading