Is your feature request related to a problem? Please describe.
Stable Diffusion is not compute heavy on all its steps. If we keep the diffusion unet on fp16 on GPU and everything else on CPU, we could reduce the GPU usage to 2.2GB while having a non-so-big impact on performance. It should democratize Stable Diffusion even further.
Only other thing that would need to be done is move the tensors from the devices accordingly, but we can use the models device and dtype attributes to make everything work.
Describe the solution you'd like
I think what I'm proposing on #537 should be enough.
Describe alternatives you've considered
Alternative is to use GPUs for the whole process and pay more for it.