Skip to content

Horovod multi-node fails to connect and hangs indefinitely #1369

@jarednielsen

Description

@jarednielsen

Describe the bug
Running a horovod tensorflow job with multiple nodes gives a "Cannot connect to host algo-1" error and hangs indefinitely. The horovod job runs successfully if I specify a single node and multiple processes. I have been able to run horovod multi-node training with the same script outside of SageMaker.

To reproduce
Bit complex to add all the scaffolding, but I can put together an reproducible example if necessary. The gist of it is

from sagemaker.tensorflow import TensorFlow
from sagemaker.inputs import FileSystemInput

role = ...
image_name = "jarednielsen/albert-tf:sagemaker"
fsx_id = ...
hvd_instance_type = "ml.p3.16xlarge"
hvd_processes_per_host = 8
hvd_instance_count = 8
batch_size = 8

distributions = {
    "mpi": {
        "enabled": True,
        "processes_per_host": hvd_processes_per_host,
        "custom_mpi_options": "-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
    }
}

hyperparameters = {
    "model_size": "base",
    "batch_size": batch_size,
    "max_seq_length": 512,
    "gradient_accumulation_steps": 1,
    "learning_rate": 0.00176,
    "optimizer": "lamb",
    "fsx_prefix": "/opt/ml/input/data/training",
    "name": "sagemaker",
}

estimator_hvd = TensorFlow(
    entry_point="/path/to/blank/file.py",
    role=role,
    framework_version="2.1.0",
    py_version="py3",
    hyperparameters=hyperparameters,
    train_instance_count=hvd_instance_count,
    train_instance_type=hvd_instance_type,
    distributions=distributions,
    image_name=image_name,
    subnets=[subnet_id],
    security_group_ids=[security_group_id],
    enable_sagemaker_metrics=True,
)

fsx_input = FileSystemInput(
    file_system_id=fsx_id,
    file_system_type="FSxLustre",
    directory_path="/fsx",
    file_system_access_mode="rw",
)

estimator_hvd.fit(fsx_input)

Expected behavior
It to not hang :)

Screenshots or logs

$ $ python run_sagemaker.py
2020-03-20 00:49:47 Starting - Starting the training job...
2020-03-20 00:49:50 Starting - Launching requested ML instances.......................................
2020-03-20 00:56:58 Starting - Preparing the instances for training.........
2020-03-20 00:58:47 Downloading - Downloading input data
2020-03-20 00:58:47 Training - Downloading the training image...............
2020-03-20 01:01:21 Training - Training image download completed. Training in progress.2020-03-20 01:01:22,880 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:22,880 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:23,049 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:23,049 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:22,603 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:22,603 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:23,981 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:23,982 sagemaker-containers INFO     Creating SSH daemon.
2020-03-20 01:01:23,986 sagemaker-containers INFO     Waiting for MPI workers to establish their SSH connections
2020-03-20 01:01:23,361 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:23,361 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:25,144 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:25,144 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:22,417 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:22,417 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:01:24,060 sagemaker-containers INFO     Starting MPI run as worker node.
2020-03-20 01:01:24,061 sagemaker-containers INFO     Waiting for MPI Master to create SSH daemon.
2020-03-20 01:03:32,823 sagemaker-containers INFO     Cannot connect to host algo-1
2020-03-20 01:03:32,824 sagemaker-containers INFO     Connection failed with exception: 
 [Errno 110] Connection timed out
[repeated 64 times]

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.50.14
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow,
  • Framework version: 2.1
  • Python version: 3.7
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): Yes

Additional context
The one lead I can think of is that I'm doing something a little different with the entrypoint script. Instead of specifying it in the Tensorflow() constructor, I specify it in the Dockerfile with

ENV SAGEMAKER_PROGRAM /opt/ml/input/data/training/myscript.py

This is a quirk specific to my situation, but works fine on single-node. Anything I should dive into to investigate the Horovod hanging issue?

I have the following in my Dockerfile, following the lead of https://github.com/aws/sagemaker-tensorflow-container/blob/master/docker/1.15.2/py3/Dockerfile.gpu

# Below here is necessary to install SSH on SageMaker machines
RUN apt-get update && apt-get install -y --no-install-recommends openssh-server && mkdir -p /var/run/sshd
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys && \
    printf "Host * StrictHostKeyChecking no" >> /root/.ssh/config
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
    && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
    && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

Anything more I need to do?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions