Automatically pick available GPU

Thanks for this great library! 

## 🚀 Feature
I would like to change the behavior of this code:
``` python
    trainer = pl.Trainer(
        ... snip ...,
        gpus=1,
    )
```
Currently, when setting `gpus` to an integer `n`, the first `n` GPUs are automatically used. 

I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.

### Motivation
When running multiple jobs in parallel on a server with multiple available GPUs, I get the error: 
```
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
```
This is because all 4 running jobs are scheduled to GPU 0, even though I have 4 GPUs available. 

*Note*: the GPUs are configured to be in "exclusive mode", which means that only one process at a time can use them. 

### Pitch

I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.

### Alternatives

One could also fix this in client code, like this: 

``` python
def retry_jittered_backoff(f, num_retries=5):
    # Based on:
    # https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
    import time
    import random
    cap = 1.0                  # max sleep time is 1s
    base = 0.01                # initial sleep time is 10ms
    sleep = base               # initial sleep time is 10ms

    for i in range(num_retries):
        try:
            return f()
        except RuntimeError as e:
            if i == num_retries - 1:
                raise e
            else:
                continue
        time.sleep(sleep)
        sleep = min(cap, random.uniform(base, sleep * 3))


def pick_gpu():
    for i in range(torch.cuda.device_count()):
        torch.cuda.set_device(i)
        try:
            torch.ones(1).cuda()
        except RuntimeError:
            continue
        return i
    raise RuntimeError("No GPUs available.")


def main(.. snip ..):

    model = Model(.. snip ..)
    trainer = pl.Trainer(
        .. snip ..,
        gpus=[retry_jittered_backoff(pick_gpu)],
    )
    trainer.fit(model)

```

This is exactly the kind of boilerplate that I would hope `pytorch-lighting` can make redundant.  

## PR-able? 

Would you be open to consider accepting a PR? If so, could you give me some pointers where to add the above code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically pick available GPU #951

🚀 Feature

Motivation

Pitch

Alternatives

PR-able?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automatically pick available GPU #951

Description

🚀 Feature

Motivation

Pitch

Alternatives

PR-able?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions