-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Description
-
Caffe2
-
OS: Ubuntu 16.04
-- Python version: 2.7 -
CUDA/cuDNN version: 9.1 , 7.0.5
-
GPU models and configuration: P100
-
GCC version (if compiling from source):5.4.0
-
Mellanox OFEDversion: 4.2.1
-
Build command you used (if compiling from source):
Eigen with CUDA9: fatal error:math_functions.hpp:No such file or directory.
Was fixed in eigen at https://bitbucket.org/eigen/eigen/commits/034b6c3e101792a3cc3ccabd9bfaddcabe85bb58?at=default
$ mkdir build
$ cd build
$ cmake .. -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="60 61" -DCUDA_ARCH_PTX="61" -DUSE_NNPACK=OFF -DUSE_ROCKSDB=OFF -DUSE_GLOO=ON -DUSE_REDIS=ON -DUSE_IBVERBS=ON -DUSE_MPI=OFF
$ make -j"$(nproc)" install
$ ldconfig
$ make clean
I try to run resnet50_trainer.py on one node by:
python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 64 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 4
and have result: ~ 430 imgs/sec
When I try to run resnet50_trainer.py between 2 machines using GLOO/Redis, but I get the following error messages.
How to fix the error and run correctly using mpirun?
My commands:
On each node was ran the sample command:
python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 2 --shard_id 0 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_3
python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 2 --shard_id 1 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_3
Errors:
INFO:resnet50_trainer:Finished iteration 155/156 of epoch 0 (207.29 images/sec)
INFO:resnet50_trainer:Training loss: 0.000181300914846, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 156/156 of epoch 0 (204.85 images/sec)
INFO:resnet50_trainer:Training loss: 0.00019393475668, accuracy: 1.0
E0404 16:13:32.042721 39912 prefetch_op.h:110] Prefetching error std::bad_alloc
E0404 16:13:32.042976 39911 prefetch_op.h:83] Prefetching failed.
E0404 16:13:32.043117 39911 net_dag.cc:231] Operator chain failed starting at: input: "test_reader" output: "gpu_0/data" output: "gpu_0/label" name: "" type: "ImageInput" arg { name: "std" f: 128 } arg { name: "scale" i: 256 } arg { name: "use_gpu_transform" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "crop" i: 256 } arg { name: "is_test" i: 1 } arg { name: "use_cudnn" i: 1 } arg { name: "use_caffe_datum" i: 1 } arg { name: "mirror" i: 1 } arg { name: "output_type" s: "float16" } arg { name: "batch_size" i: 32 } arg { name: "mean" f: 128 } device_option { device_type: 1 cuda_gpu_id: 0 }
E0404 16:13:32.043376 39646 net.h:54] Failed to execute async run
Original python traceback for operator 1886221678 in network resnet50_test
in exception above (most recent call last):
Traceback (most recent call last):
File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 615, in
main()
File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 611, in main
Train(args)
File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 529, in Train
explog
File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 201, in RunEpoch
workspace.RunNet(test_model.net.Proto().name)
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 215, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 177, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at pybind_state.cc:1001] success. Error running net resnet50_test
O second node:
INFO:resnet50_trainer:Training loss: 6.98969364166, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 156/156 of epoch 0 (205.14 images/sec)
INFO:resnet50_trainer:Training loss: 6.99336242676, accuracy: 0.0
INFO:resnet50_trainer:Starting epoch 1/2
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /root/caffe2/third_party/gloo/gloo/transport/ibverbs/pair.cc:417] wc->status == IBV_WC_SUCCESS. 12 vs 0. Send for slot 2: transport retry counter exceeded
*** Aborted at 1522847623 (unix time) try "date -d @1522847623" if you are using GNU date ***
PC: @ 0x7f84fc58e428 gsignal
*** SIGABRT (@0x9a14) received by PID 39444 (TID 0x7f83ebf8d700) from PID 39444; stack trace: ***
@ 0x7f84fc58e4b0 (unknown)
@ 0x7f84fc58e428 gsignal
@ 0x7f84fc59002a abort
@ 0x7f848079584d __gnu_cxx::__verbose_terminate_handler()
@ 0x7f84807936b6 (unknown)
@ 0x7f8480793701 std::terminate()
@ 0x7f84807bed38 (unknown)
@ 0x7f84fc92a6ba start_thread
@ 0x7f84fc66041d clone
@ 0x0 (unknown)
Aborted (core dumped)