The CPUOptimizerOffload class is very clever, but overly relies on CUDA Streams, which aren't available w/o a CUDA device.
should use torch.cpu.Stream and torch.cpu.current_stream instead.
additionally, pin_memory=True if torch.cuda.is_available() else False as MPS is a unified mem arch.