-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm using T4 with colab free , when I start training it tells me cuda error, it happens when I activate prior_preservation.
Run training
Launching training on one GPU.
Steps: 0%
1/450 [00:10<1:20:12, 10.72s/it, loss=0.0338]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-c6e3ce5f5a40> in <module>
1 #@title Run training
2 import accelerate
----> 3 accelerate.notebook_launcher(training_function, args=(text_encoder, vae, unet))
4 with torch.no_grad():
5 torch.cuda.empty_cache()
7 frames
/usr/local/lib/python3.7/dist-packages/accelerate/launchers.py in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
81 else:
82 print("Launching training on one CPU.")
---> 83 function(*args)
84
85 else:
<ipython-input-1-d9553ec566fc> in training_function(text_encoder, vae, unet)
364 loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()
365
--> 366 accelerator.backward(loss)
367 accelerator.clip_grad_norm_(unet.parameters(), args.max_grad_norm)
368 optimizer.step()
/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py in backward(self, loss, **kwargs)
882 self.scaler.scale(loss).backward(**kwargs)
883 else:
--> 884 loss.backward(**kwargs)
885
886 def unscale_gradients(self, optimizer=None):
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
397
398 def register_hook(self, hook):
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
176
177 def grad(
/usr/local/lib/python3.7/dist-packages/torch/autograd/function.py in apply(self, *args)
251 "of them.")
252 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn
--> 253 return user_fn(self, *args)
254
255 def apply_jvp(self, *args):
/usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in backward(ctx, *args)
144 "none of output has requires_grad=True,"
145 " this checkpoint() is not necessary")
--> 146 torch.autograd.backward(outputs_with_grad, args_with_grad)
147 grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None
148 for inp in detached_inputs)
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
176
177 def grad(
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.24 GiB already allocated; 877.75 MiB free; 12.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Reproduction
No response
Logs
No response
System Info
T4 with colab free
skirsten
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
