-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Running a horovod tensorflow job with multiple nodes gives a "Cannot connect to host algo-1" error and hangs indefinitely. The horovod job runs successfully if I specify a single node and multiple processes. I have been able to run horovod multi-node training with the same script outside of SageMaker.
To reproduce
Bit complex to add all the scaffolding, but I can put together an reproducible example if necessary. The gist of it is
from sagemaker.tensorflow import TensorFlow
from sagemaker.inputs import FileSystemInput
role = ...
image_name = "jarednielsen/albert-tf:sagemaker"
fsx_id = ...
hvd_instance_type = "ml.p3.16xlarge"
hvd_processes_per_host = 8
hvd_instance_count = 8
batch_size = 8
distributions = {
"mpi": {
"enabled": True,
"processes_per_host": hvd_processes_per_host,
"custom_mpi_options": "-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
}
}
hyperparameters = {
"model_size": "base",
"batch_size": batch_size,
"max_seq_length": 512,
"gradient_accumulation_steps": 1,
"learning_rate": 0.00176,
"optimizer": "lamb",
"fsx_prefix": "/opt/ml/input/data/training",
"name": "sagemaker",
}
estimator_hvd = TensorFlow(
entry_point="/path/to/blank/file.py",
role=role,
framework_version="2.1.0",
py_version="py3",
hyperparameters=hyperparameters,
train_instance_count=hvd_instance_count,
train_instance_type=hvd_instance_type,
distributions=distributions,
image_name=image_name,
subnets=[subnet_id],
security_group_ids=[security_group_id],
enable_sagemaker_metrics=True,
)
fsx_input = FileSystemInput(
file_system_id=fsx_id,
file_system_type="FSxLustre",
directory_path="/fsx",
file_system_access_mode="rw",
)
estimator_hvd.fit(fsx_input)
Expected behavior
It to not hang :)
Screenshots or logs
$ $ python run_sagemaker.py
2020-03-20 00:49:47 Starting - Starting the training job...
2020-03-20 00:49:50 Starting - Launching requested ML instances.......................................
2020-03-20 00:56:58 Starting - Preparing the instances for training.........
2020-03-20 00:58:47 Downloading - Downloading input data
2020-03-20 00:58:47 Training - Downloading the training image...............
2020-03-20 01:01:21 Training - Training image download completed. Training in progress.2020-03-20 01:01:22,880 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:22,880 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:23,049 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:23,049 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:22,603 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:22,603 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:23,981 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:23,982 sagemaker-containers INFO Creating SSH daemon.
2020-03-20 01:01:23,986 sagemaker-containers INFO Waiting for MPI workers to establish their SSH connections
2020-03-20 01:01:23,361 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:23,361 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:25,144 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:25,144 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:22,417 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:22,417 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:24,060 sagemaker-containers INFO Starting MPI run as worker node.
2020-03-20 01:01:24,061 sagemaker-containers INFO Waiting for MPI Master to create SSH daemon.
2020-03-20 01:03:32,823 sagemaker-containers INFO Cannot connect to host algo-1
2020-03-20 01:03:32,824 sagemaker-containers INFO Connection failed with exception:
[Errno 110] Connection timed out
[repeated 64 times]
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.50.14
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow,
- Framework version: 2.1
- Python version: 3.7
- CPU or GPU: GPU
- Custom Docker image (Y/N): Yes
Additional context
The one lead I can think of is that I'm doing something a little different with the entrypoint script. Instead of specifying it in the Tensorflow() constructor, I specify it in the Dockerfile with
ENV SAGEMAKER_PROGRAM /opt/ml/input/data/training/myscript.py
This is a quirk specific to my situation, but works fine on single-node. Anything I should dive into to investigate the Horovod hanging issue?
I have the following in my Dockerfile, following the lead of https://github.com/aws/sagemaker-tensorflow-container/blob/master/docker/1.15.2/py3/Dockerfile.gpu
# Below here is necessary to install SSH on SageMaker machines
RUN apt-get update && apt-get install -y --no-install-recommends openssh-server && mkdir -p /var/run/sshd
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN mkdir -p /root/.ssh/ && \
ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys && \
printf "Host * StrictHostKeyChecking no" >> /root/.ssh/config
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
Anything more I need to do?