Skip to content

Repeated calls to MPI_Cart_create segfaults or hangs when first argument is a cart comm #6522

@wrs20

Description

@wrs20

Fails in OpenMPI 4.0.0 and 3.1.3, passes in OpenMPI 2.1.1 (and MPICH).
OpenMPI installed from 4.0.0 source tar. Ubuntu 18.04.2.

Source:

// cart_break.c
#include <mpi.h>
int main(int argc, char * argv[]){
   
    MPI_Init(&argc, &argv);
    MPI_Comm parent0, parent1, child0, child1;

    int ndims2[3] = {2, 1, 1};
    int ndims1[3] = {1, 1, 1};
    int periods[3] = {1, 1, 1};
    
    // parent 0 (works)
    if (MPI_Cart_create(MPI_COMM_WORLD, 3, ndims2, periods, 0, &parent0) != MPI_SUCCESS) { return -1; }

    // child0 from parent0 (works)
    if(MPI_Cart_create(parent0, 3, ndims1, periods, 0, &child0) != MPI_SUCCESS) { return -1; }


    // parent 1 (works)
    if(MPI_Cart_create(MPI_COMM_WORLD, 3, ndims2, periods, 0, &parent1) != MPI_SUCCESS){ return -1; }

    // child1 from parent1 (hangs in mpi4py, segfaults in c) passes if parent1 is replaced with either parent0 or MPI_COMM_WORLD
    if(MPI_Cart_create(parent1, 3, ndims1, periods, 0, &child1) != MPI_SUCCESS) {return -1;}
    

    // cleanup
    if (child0 != MPI_COMM_NULL){ MPI_Comm_free(&child0); }
    if (child1 != MPI_COMM_NULL){ MPI_Comm_free(&child1); }
    if (parent0 != MPI_COMM_NULL){ MPI_Comm_free(&parent0); }
    if (parent1 != MPI_COMM_NULL){ MPI_Comm_free(&parent1); }


    MPI_Finalize();
    return 0;
}

To reproduce:

shell$ mpicc cart_break.c
shell$ mpirun -n 2 ./a.out

Traceback:

[pypc-dm-07:10294] *** Process received signal ***
[pypc-dm-07:10294] Signal: Segmentation fault (11)
[pypc-dm-07:10294] Signal code: Address not mapped (1)
[pypc-dm-07:10294] Failing at address: 0x8
[pypc-dm-07:10294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f34b01f6f20]
[pypc-dm-07:10294] [ 1] /home/wrs20/opt/openmpi-4.0.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x995)[0x7f349f3cf1f5]
[pypc-dm-07:10294] [ 2] /home/wrs20/opt/openmpi-4.0.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7f349eb0439f]
[pypc-dm-07:10294] [ 3] /home/wrs20/opt/openmpi-4.0.0/lib/openmpi/mca_btl_vader.so(+0x46c7)[0x7f349eb046c7]
[pypc-dm-07:10294] [ 4] /home/wrs20/opt/openmpi-4.0.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f34afc1cc2c]
[pypc-dm-07:10294] [ 5] /home/wrs20/opt/openmpi-4.0.0/lib/libmpi.so.40(ompi_comm_nextcid+0x105)[0x7f34b05deb25]
[pypc-dm-07:10294] [ 6] /home/wrs20/opt/openmpi-4.0.0/lib/libmpi.so.40(ompi_comm_enable+0x39)[0x7f34b05dc0f9]
[pypc-dm-07:10294] [ 7] /home/wrs20/opt/openmpi-4.0.0/lib/libmpi.so.40(mca_topo_base_cart_create+0x1c4)[0x7f34b0683a54]
[pypc-dm-07:10294] [ 8] /home/wrs20/opt/openmpi-4.0.0/lib/libmpi.so.40(MPI_Cart_create+0x25f)[0x7f34b061299f]
[pypc-dm-07:10294] [ 9] a.out(+0xa68)[0x56198c604a68]
[pypc-dm-07:10294] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f34b01d9b97]
[pypc-dm-07:10294] [11] a.out(+0x84a)[0x56198c60484a]
[pypc-dm-07:10294] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Many Thanks,
Will

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions