Skip to content

Automatically pick available GPU #951

@ahendriksen

Description

@ahendriksen

Thanks for this great library!

🚀 Feature

I would like to change the behavior of this code:

    trainer = pl.Trainer(
        ... snip ...,
        gpus=1,
    )

Currently, when setting gpus to an integer n, the first n GPUs are automatically used.

I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.

Motivation

When running multiple jobs in parallel on a server with multiple available GPUs, I get the error:

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

This is because all 4 running jobs are scheduled to GPU 0, even though I have 4 GPUs available.

Note: the GPUs are configured to be in "exclusive mode", which means that only one process at a time can use them.

Pitch

I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.

Alternatives

One could also fix this in client code, like this:

def retry_jittered_backoff(f, num_retries=5):
    # Based on:
    # https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
    import time
    import random
    cap = 1.0                  # max sleep time is 1s
    base = 0.01                # initial sleep time is 10ms
    sleep = base               # initial sleep time is 10ms

    for i in range(num_retries):
        try:
            return f()
        except RuntimeError as e:
            if i == num_retries - 1:
                raise e
            else:
                continue
        time.sleep(sleep)
        sleep = min(cap, random.uniform(base, sleep * 3))


def pick_gpu():
    for i in range(torch.cuda.device_count()):
        torch.cuda.set_device(i)
        try:
            torch.ones(1).cuda()
        except RuntimeError:
            continue
        return i
    raise RuntimeError("No GPUs available.")


def main(.. snip ..):

    model = Model(.. snip ..)
    trainer = pl.Trainer(
        .. snip ..,
        gpus=[retry_jittered_backoff(pick_gpu)],
    )
    trainer.fit(model)

This is exactly the kind of boilerplate that I would hope pytorch-lighting can make redundant.

PR-able?

Would you be open to consider accepting a PR? If so, could you give me some pointers where to add the above code?

Metadata

Metadata

Assignees

Labels

discussionIn a discussion stagefeatureIs an improvement or enhancementhelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions