-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Thanks for this great library!
🚀 Feature
I would like to change the behavior of this code:
trainer = pl.Trainer(
... snip ...,
gpus=1,
)Currently, when setting gpus to an integer n, the first n GPUs are automatically used.
I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.
Motivation
When running multiple jobs in parallel on a server with multiple available GPUs, I get the error:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
This is because all 4 running jobs are scheduled to GPU 0, even though I have 4 GPUs available.
Note: the GPUs are configured to be in "exclusive mode", which means that only one process at a time can use them.
Pitch
I would like to change the behavior such that when multiple GPUs are available, the trainer picks the first available GPU.
Alternatives
One could also fix this in client code, like this:
def retry_jittered_backoff(f, num_retries=5):
# Based on:
# https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
import time
import random
cap = 1.0 # max sleep time is 1s
base = 0.01 # initial sleep time is 10ms
sleep = base # initial sleep time is 10ms
for i in range(num_retries):
try:
return f()
except RuntimeError as e:
if i == num_retries - 1:
raise e
else:
continue
time.sleep(sleep)
sleep = min(cap, random.uniform(base, sleep * 3))
def pick_gpu():
for i in range(torch.cuda.device_count()):
torch.cuda.set_device(i)
try:
torch.ones(1).cuda()
except RuntimeError:
continue
return i
raise RuntimeError("No GPUs available.")
def main(.. snip ..):
model = Model(.. snip ..)
trainer = pl.Trainer(
.. snip ..,
gpus=[retry_jittered_backoff(pick_gpu)],
)
trainer.fit(model)This is exactly the kind of boilerplate that I would hope pytorch-lighting can make redundant.
PR-able?
Would you be open to consider accepting a PR? If so, could you give me some pointers where to add the above code?