-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
I've been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.
However, I ran into some performance issues:
- AMP with pytorch's
torch.nn.parallel.DistributedDataParallel
was extremely slow. - AMP with
apex.parallel.DistributedDataParallel
was slower than the default training withtorch.nn.DistributedDataParallel
(no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)
I followed the installation instructions, but I couldn't install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?
You can see the code here:
https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498
And run it (2 GPUs):
Without APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1
With APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision