-
Notifications
You must be signed in to change notification settings - Fork 926
Open
Description
Background information
Getting a segfault in several tests including mpich
's nbicalltoallw
. I bisected the issue and found the guilty PR: #6806 (c9e4240e70d6e5c1186f7ba2090b8f5bc1c9dc2b is the first bad commit
).
v4.0.x before it (git reset --hard 507fcc9
) works well.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v4.0.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
# checkout latest v4.0.x branch
./autogen.pl
./configure --without-ucx --without-hcoll --prefix=$PWD/__install
make -j8
make install -j8
Please describe the system on which you are running
- Operating system/version: RHEL7.4
- Computer hardware: x86_64
- Network type: IB
Details of the problem
Build mpich test
wget http://www.mpich.org/static/downloads/3.3.1/mpich-3.3.1.tar.gz
tar xzf mpich-3.3.1.tar.gz
cd mpich-3.3.1
./autogen.sh
cd test/mpi
./configure --with-mpi=/ompi/__install --enable-strictmpi --disable-spawn --disable-cxx
Run test (I used slurm allocation):
./mpirun -np 4 -mca btl self,vader -mca pml ob1 -mca coll ^hcoll --map-by core /mpich-3.3.1/test/mpi/coll/nbicalltoallw
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x31
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fe957b3b5e0]
[ 1] /ompi/__install/lib/libmpi.so.40(ompi_coll_base_retain_datatypes_w+0x74)[0x7fe957de8024]
[ 2] /ompi/__install/lib/libmpi.so.40(PMPI_Ialltoallw+0x2c1)[0x7fe957dac591]
[ 3] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401c28]
[ 4] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401f81]
[ 5] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe95778ac05]
[ 6] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401af9]
*** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Metadata
Metadata
Assignees
Labels
No labels