optimizer CPU offload doesn't work outside of CUDA

The CPUOptimizerOffload class is very clever, but overly relies on CUDA Streams, which aren't available w/o a CUDA device.

should use `torch.cpu.Stream` and `torch.cpu.current_stream` instead.

additionally, `pin_memory=True if torch.cuda.is_available() else False` as MPS is a unified mem arch.