Mixed precision training slower than FP32 training

I've been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.

However, I ran into some performance issues:

1) AMP with pytorch's `torch.nn.parallel.DistributedDataParallel` was extremely slow.
2) AMP with `apex.parallel.DistributedDataParallel` was slower than the default training with `torch.nn.DistributedDataParallel` (no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)

I followed the installation instructions, but I couldn't install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?

You can see the code here:
https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498

And run it (2 GPUs):

Without APEX AMP:
`python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1`

With APEX AMP:
`python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixed precision training slower than FP32 training #297

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mixed precision training slower than FP32 training #297

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions