Skip to content

failure during MPI_Win_detach #7384

@naughtont3

Description

@naughtont3

Background information

Application failure with Open MPI with one sided communication (OSC).

Reporting on behalf of user to help track problem.

The test works fine with MPICH/3.x, Spectrum MPI, and Intel MPI.

What version of Open MPI are you using?

  • Fails with latest master (c6831c5)
  • Need to check status with v4.0.x and v3.0.x

Describe how Open MPI was installed

Standard tarball build can reproduce, using gcc/8.x compiler suite (gcc-8, g++-8, gfortran-8). Need gcc > 8.x to avoid past gfortran bugs.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Linux
  • Computer hardware:
  • Network type:

Reproducible on any Linux workstation (Need gcc > 8.x to avoid past gfortran bugs)


Details of the problem

You should be able to reproduce this failure on any Linux workstation. The only
thing you need to make sure is to use the gcc/8.x compiler suite (gcc-8, g++-8,
gfortran-8) since other versions are buggy in their gfortran part.

    1. git clone https://gitlab.com/DmitryLyakh/ExaTensor.git
    1. git checkout openmpi_fail
    1. export PATH_OPENMPI=PATH_TO_OPENMPI_ROOT_DIR
    1. make
    1. Copy the produced binary Qforce.x into some directory and place both attached scripts there as well
    1. Run run.exatensor.sh (it runs 4 MPI processes, which is the minimal configuration; each process runs up to 8 threads, which is also mandatory,but all these can be run on a single node)

I added .txt extension to attach to github ticket.

Normally run.exatensor.sh invokes mpiexec with the binary directly, but for some reason the mpiexec from the latest GIT master branch fails to load some dynamic libraries (libgfortran), so I introduced a workaround where run.exatensor.sh invokes mpiexec with exec.sh, which in turn executes the binary Qforce.x. Previous OpenMPI versions did not have this issue by the way. But all of them fail in MPI_Win_detach as you can see below:

Destroying tensor dtens ... [exadesktop:32108] *** An error occurred in MPI_Win_detach
[exadesktop:32108] *** reported by process [3156279297,1]
[exadesktop:32108] *** on win rdma window 5
[exadesktop:32108] *** MPI_ERR_OTHER: known error not in list
[exadesktop:32108] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[exadesktop:32108] ***    and potentially your MPI job)
[exadesktop:32108] [0]func:/usr/local/mpi/openmpi/git/lib/libopen-pal.so.0(opal_backtrace_buffer+0x35) [0x149adfa0726f]
[exadesktop:32108] [1] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_mpi_abort+0x9a) [0x149ae0574db1]
[exadesktop:32108] [2] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(+0x48d6e) [0x149ae055ad6e]
[exadesktop:32108] [3]func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_mpi_errors_are_fatal_win_h andler+0xed) [0x149ae055a3d2]
[exadesktop:32108] [4] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_errhandler_invoke+0x155) [0x149ae0559c11]
[exadesktop:32108] [5] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(PMPI_Win_detach+0x197) [0x149ae05f2417]
[exadesktop:32108] [6] func:/usr/local/mpi/openmpi/git/lib/libmpi_mpifh.so.0(mpi_win_detach__+0x38) [0x149ae0946d86]
[exadesktop:32108] [7] func:./Qforce.x() [0x564a82]
[exadesktop:32108] [8] func:./Qforce.x() [0x564b42]
[exadesktop:32108] [9] func:./Qforce.x() [0x56df9e]
[exadesktop:32108] [10] func:./Qforce.x() [0x4319fa]
[exadesktop:32108] [11] func:./Qforce.x() [0x42a326]
[exadesktop:32108] [12] func:./Qforce.x() [0x42e2cc]
[exadesktop:32108] [13] func:./Qforce.x() [0x4de039]
[exadesktop:32108] [14] func:/usr/local/gcc/8.2.0/lib64/libgomp.so.1(+0x1743e) [0x149ae841343e]
[exadesktop:32108] [15] func:/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x149ae1d416db]
[exadesktop:32108] [16] func:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x149adfdea88f]
[exadesktop:32103] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix4x/openpmix/src/server/pmix_server.c at line 2188
[exadesktop:32103] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[exadesktop:32103] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions