Skip to content

mca/coll: segfault on v4.0.x #6876

@amaslenn

Description

@amaslenn

Background information

Getting a segfault in several tests including mpich's nbicalltoallw. I bisected the issue and found the guilty PR: #6806 (c9e4240e70d6e5c1186f7ba2090b8f5bc1c9dc2b is the first bad commit).
v4.0.x before it (git reset --hard 507fcc9) works well.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

# checkout latest v4.0.x branch
./autogen.pl
./configure --without-ucx --without-hcoll --prefix=$PWD/__install
make -j8
make install -j8

Please describe the system on which you are running

  • Operating system/version: RHEL7.4
  • Computer hardware: x86_64
  • Network type: IB

Details of the problem

Build mpich test

wget http://www.mpich.org/static/downloads/3.3.1/mpich-3.3.1.tar.gz
tar xzf mpich-3.3.1.tar.gz
cd mpich-3.3.1
./autogen.sh
cd test/mpi
./configure --with-mpi=/ompi/__install --enable-strictmpi --disable-spawn --disable-cxx

Run test (I used slurm allocation):

./mpirun -np 4 -mca btl self,vader -mca pml ob1 -mca coll ^hcoll --map-by core /mpich-3.3.1/test/mpi/coll/nbicalltoallw

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x31
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fe957b3b5e0]
[ 1] /ompi/__install/lib/libmpi.so.40(ompi_coll_base_retain_datatypes_w+0x74)[0x7fe957de8024]
[ 2] /ompi/__install/lib/libmpi.so.40(PMPI_Ialltoallw+0x2c1)[0x7fe957dac591]
[ 3] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401c28]
[ 4] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401f81]
[ 5] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe95778ac05]
[ 6] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401af9]
*** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions