-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
I am trying to squeeze training onto my 6GB laptop RTX 2060, and cant quite manage it with the "low memory" config:
accelerate launch --num_cpu_threads_per_process 8 train_db.py \
--pretrained_model_name_or_path="/home/alpha/Storage/AIModels/Stable-diffusion/panatomy05full_0.7-AIModels_Anything-V3.0-pruned-fp16_0.3-Weighted_sum-merged.ckpt" \
--train_data_dir="/home/alpha/Storage/TrainingData/test/training_data" \
--output_dir="/home/alpha/Storage/TrainingOutput/test/" \
--prior_loss_weight=1.0 \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=1e-6 \
--max_train_steps=1600 \
--use_8bit_adam \
--xformers \
--mixed_precision="fp16" \
--cache_latents \
--gradient_checkpointing \
--save_precision="fp16" \
--full_fp16 \
--save_model_as="safetensors" \
So, I figured I would investigate Deepspeed cpu offloading with the accelerate config... but I keep running into errors on both the git version and the 0.7.7 release from pypi. Here is an error from the pypi release:
Traceback (most recent call last):
File "/home/alpha/clone/sd-scripts/train_db.py", line 332, in <module>
train(args)
File "/home/alpha/clone/sd-scripts/train_db.py", line 154, in train
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/home/alpha/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 619, in prepare
result = self._prepare_deepspeed(*args)
File "/home/alpha/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 805, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 330, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1210, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1455, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 532, in __init__
self._param_slice_mappings = self._create_param_mapping()
File "/home/alpha/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 544, in _create_param_mapping
lp_name = self.param_names[lp]
KeyError: <exception str() failed>
[2023-01-12 13:13:52,241] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 5398
[2023-01-12 13:13:52,244] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'train_db.py', '--pretrained_model_name_or_path=/home/alpha/Storage/AIModels/Stable-diffusion/panatomy05full_0.7-AIModels_Anything-V3.0-pruned-fp16_0.3-Weighted_sum-merged.ckpt', '--train_data_dir=/home/alpha/Storage/TrainingData/test/training_data', '--output_dir=/home/alpha/Storage/TrainingOutput/test/', '--prior_loss_weight=1.0', '--resolution=512', '--train_batch_size=1', '--learning_rate=1e-6', '--max_train_steps=1600', '--use_8bit_adam', '--xformers', '--mixed_precision=fp16', '--cache_latents', '--gradient_checkpointing', '--save_precision=fp16', '--full_fp16', '--save_model_as=safetensors'] exits with return code = 1
Traceback (most recent call last):
File "/home/alpha/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/alpha/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/alpha/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 827, in launch_command
deepspeed_launcher(args)
File "/home/alpha/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 540, in deepspeed_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--num_gpus', '1', 'train_db.py', '--pretrained_model_name_or_path=/home/alpha/Storage/AIModels/Stable-diffusion/panatomy05full_0.7-AIModels_Anything-V3.0-pruned-fp16_0.3-Weighted_sum-merged.ckpt', '--train_data_dir=/home/alpha/Storage/TrainingData/test/training_data', '--output_dir=/home/alpha/Storage/TrainingOutput/test/', '--prior_loss_weight=1.0', '--resolution=512', '--train_batch_size=1', '--learning_rate=1e-6', '--max_train_steps=1600', '--use_8bit_adam', '--xformers', '--mixed_precision=fp16', '--cache_latents', '--gradient_checkpointing', '--save_precision=fp16', '--full_fp16', '--save_model_as=safetensors']' returned non-zero exit status 1.
Is there anything in particular that needs to be changed for this repo to support deepspeed? Or maybe there is some other tweak to squeeze LORA onto 6GB?
Metadata
Metadata
Assignees
Labels
No labels