[caffe2] Run resnet50_trainer.py error between 2 machines using GLOO/Redis

- Caffe2
- OS: Ubuntu 16.04
-- Python version: 2.7
- CUDA/cuDNN version: 9.1 , 7.0.5
- GPU models and configuration: P100
- GCC version (if compiling from source):5.4.0
- Mellanox OFEDversion: 4.2.1
- Build command you used (if compiling from source):

  Eigen with CUDA9: fatal error：math_functions.hpp：No such file or directory.

Was fixed in eigen at https://bitbucket.org/eigen/eigen/commits/034b6c3e101792a3cc3ccabd9bfaddcabe85bb58?at=default

    $ mkdir build
    $ cd build
    $ cmake .. -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="60 61" -DCUDA_ARCH_PTX="61" -DUSE_NNPACK=OFF -DUSE_ROCKSDB=OFF -DUSE_GLOO=ON -DUSE_REDIS=ON -DUSE_IBVERBS=ON -DUSE_MPI=OFF
    $ make -j"$(nproc)" install
    $ ldconfig
    $ make clean

I try to run resnet50_trainer.py on one node by: 
 python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 64 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 4

and have result: ~ 430 imgs/sec

When I try to run resnet50_trainer.py between 2 machines using GLOO/Redis, but I get the following error messages. 
How to fix the error and run correctly using mpirun?

My commands:
On each node was ran the sample command:
python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 2 --shard_id 0 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_3

python /root/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /data/ilsvrc12_train_lmdb/ --test_data /data/ilsvrc12_val_lmdb --batch_size 32 --run_id 1 --epoch_size 10000 --num_epochs 2 --image_size 256 --num_gpus 1 --redis_host 10.143.119.44 --redis_port 5555 --num_shards 2 --shard_id 1 --dtype float16 --float16_compute --distributed_transport ibverbs --distributed_interfaces mlx5_3

Errors:
INFO:resnet50_trainer:Finished iteration 155/156 of epoch 0 (207.29 images/sec)
INFO:resnet50_trainer:Training loss: 0.000181300914846, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 156/156 of epoch 0 (204.85 images/sec)
INFO:resnet50_trainer:Training loss: 0.00019393475668, accuracy: 1.0
E0404 16:13:32.042721 39912 prefetch_op.h:110] Prefetching error std::bad_alloc
E0404 16:13:32.042976 39911 prefetch_op.h:83] Prefetching failed.
E0404 16:13:32.043117 39911 net_dag.cc:231] Operator chain failed starting at: input: "test_reader" output: "gpu_0/data" output: "gpu_0/label" name: "" type: "ImageInput" arg { name: "std" f: 128 } arg { name: "scale" i: 256 } arg { name: "use_gpu_transform" i: 1 } arg { name: "cudnn_exhaustive_search" i: 1 } arg { name: "crop" i: 256 } arg { name: "is_test" i: 1 } arg { name: "use_cudnn" i: 1 } arg { name: "use_caffe_datum" i: 1 } arg { name: "mirror" i: 1 } arg { name: "output_type" s: "float16" } arg { name: "batch_size" i: 32 } arg { name: "mean" f: 128 } device_option { device_type: 1 cuda_gpu_id: 0 }
E0404 16:13:32.043376 39646 net.h:54] Failed to execute async run
Original python traceback for operator 1886221678 in network `resnet50_test` in exception above (most recent call last):
Traceback (most recent call last):
  File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 615, in <module>
    main()
  File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 611, in main
    Train(args)
  File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 529, in Train
    explog
  File "/root/caffe2/caffe2/python/examples/resnet50_trainer.py", line 201, in RunEpoch
    workspace.RunNet(test_model.net.Proto().name)
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 215, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 177, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at pybind_state.cc:1001] success. Error running net resnet50_test 



O second node:
INFO:resnet50_trainer:Training loss: 6.98969364166, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 156/156 of epoch 0 (205.14 images/sec)
INFO:resnet50_trainer:Training loss: 6.99336242676, accuracy: 0.0
INFO:resnet50_trainer:Starting epoch 1/2
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /root/caffe2/third_party/gloo/gloo/transport/ibverbs/pair.cc:417] wc->status == IBV_WC_SUCCESS. 12 vs 0. Send for slot 2: transport retry counter exceeded
*** Aborted at 1522847623 (unix time) try "date -d @1522847623" if you are using GNU date ***
PC: @     0x7f84fc58e428 gsignal
*** SIGABRT (@0x9a14) received by PID 39444 (TID 0x7f83ebf8d700) from PID 39444; stack trace: ***
    @     0x7f84fc58e4b0 (unknown)
    @     0x7f84fc58e428 gsignal
    @     0x7f84fc59002a abort
    @     0x7f848079584d __gnu_cxx::__verbose_terminate_handler()
    @     0x7f84807936b6 (unknown)
    @     0x7f8480793701 std::terminate()
    @     0x7f84807bed38 (unknown)
    @     0x7f84fc92a6ba start_thread
    @     0x7f84fc66041d clone
    @                0x0 (unknown)
Aborted (core dumped)








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[caffe2] Run resnet50_trainer.py error between 2 machines using GLOO/Redis #6269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[caffe2] Run resnet50_trainer.py error between 2 machines using GLOO/Redis #6269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions