Skip to content

HDF5 parallel test suite on Lustre hangs with OMPI 4.0.5 #6871

@bodgerer

Description

@bodgerer
  • Open MPI: 4.0.1
  • Operating system/version: centos7.6
  • Computer hardware: Intel Skylake
  • Network type: Mellanox EDR InfiniBand (although not used in steps to reproduce)

I'm opening this ticket because HDF5 1.8.21 hangs on its parallel test suite with OpenMPI 4.0.1 when run on our Lustre 2.12.2 parallel filesystem. They run to completion when run on an ext4 filesystem.

The following script reproduces both the openmpi/hdf5 build and the issue with the test suite. Can someone help, please?

Thanks,

Mark

#!/bin/bash

# We have run this on a centos7.6 system with an idle Lustre 2.12.2
# filesystem mounted. Same Lustre version on both client and servers.
#
# If we run the script from a location on the Lustre filesystem,
# the hdf5 test "testphdf5" hangs until it is terminated by its 20
# minute alarm. Increasing the alarm timout makes no difference
# (it's a new, idle filesystem). The last lines that the test
# printed was 6 copies of this:
#
# Testing  -- multi-chunk collective chunk io (cchunk3)
#
# If we run the script from a location on an ext4 filesystem,
# the hdf5 test "testphdf5" completes (although fails hdf5 1.8.21's
# t_pflush1 test, which I believe is a known, separate issue)

set -x
set -e

# (needed on our system to ensure we are using the OS-provided
# version of GCC, etc.)
module purge || true

test -d build || mkdir build
test -d src || mkdir src

prefix=`pwd`/build
export PATH=${prefix}/bin:$PATH

cd src

# openmpi

wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.bz2 
tar xf openmpi-4.0.1.tar.bz2
cd openmpi-4.0.1

# (we get the same test behaviour without "--with-io-romio-flags"; however,
# OpenMPI's ROMIO on Lustre is clearly broken without it - trying to use an
# MPI-IO hint to stripe a file didn't work)

./configure --prefix=$prefix \
   --with-io-romio-flags=--with-file-system=lustre+ufs \
   --enable-mpi1-compatibility

make -j12
make install
cd ..

# (ignore infiniband for now if we have it - hdf5 tests only need one host.
# This avoids some warning messages we can ignore)
export OMPI_MCA_btl=^openib

# (disable openmpi's own MPI-IO implementation - recommended by the hdf5
# folks. I believe openmpi defaults to ROMIO on Lustre anyway)
export OMPI_MCA_io=^mpio


# hdf5

wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.21/src/hdf5-1.8.21.tar.bz2 
tar xf hdf5-1.8.21.tar.bz2
cd hdf5-1.8.21

export CC=mpicc
export CXX=mpicxx
export FC=mpif90
export F77=mpif77
export F90=mpif90

./configure --prefix=$prefix --enable-parallel
make -j12
make check
make install
cd ..

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions