Skip to content

Hang in bcast_intra_generic #3667

@rhc54

Description

@rhc54

This is occurring on master and v3.x, but only in a very specific use-case (which is why it isn't getting picked up in testing). The problem occurs when we are executing a dynamic operation (spawn, connect/accept, etc.) on two nodes where the number of participating procs is identical on each node:

After the child procs successfully start up, everything hangs in bcast_intra_generic:

#0  0x00007f1e8bb46dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007f1e8ce0dff8 in poll_dispatch (base=0xc13dd0, tv=<optimized out>) at poll.c:165
#2  0x00007f1e8ce04ab6 in opal_libevent2022_event_base_loop (base=0xc13dd0, flags=2) at event.c:1630
#3  0x00007f1e8cda0799 in opal_progress () at runtime/opal_progress.c:204
#4  0x00007f1e8d4191f1 in ompi_request_wait_completion (req=0xd08600) at ../ompi/request/request.h:392
#5  0x00007f1e8d41922f in ompi_request_default_wait (req_ptr=0x7ffd57a909c0, status=0x0) at request/req_wait.c:42
#6  0x00007f1e8d4b214a in ompi_coll_base_bcast_intra_generic (buffer=0x7ffd57a90f48, original_count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0, 
    count_by_segment=1, tree=0xd0abe0) at base/coll_base_bcast.c:209
#7  0x00007f1e8d4b282e in ompi_coll_base_bcast_intra_binomial (buffer=0x7ffd57a90f48, count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0, segsize=0)
    at base/coll_base_bcast.c:335
#8  0x00007f1e7b9e3e82 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x7ffd57a90f48, count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0)
    at coll_tuned_decision_fixed.c:258
#9  0x00007f1e8d3f9965 in ompi_dpm_connect_accept (comm=0x6025c0 <ompi_mpi_comm_world>, root=0, port_string=0xd08ab0 "2558984193.0:2441757066", send_first=true, newcomm=0x7ffd57a911d0) at dpm/dpm.c:239
#10 0x00007f1e8d3fe9d3 in ompi_dpm_dyn_init () at dpm/dpm.c:1004
#11 0x00007f1e8d41c49a in ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7ffd57a9143c) at runtime/ompi_mpi_init.c:949
#12 0x00007f1e8d45a147 in PMPI_Init (argc=0x0, argv=0x0) at pinit.c:68
#13 0x0000000000400d23 in main (argc=1, argv=0x7ffd57a915c8) at simple_spawn.c:19

The full backtrace of all threads:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f1e8a184700 (LWP 207206)):
#0  0x00007f1e8bb46dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007f1e8ce0dff8 in poll_dispatch (base=0xc2cb40, tv=<optimized out>) at poll.c:165
#2  0x00007f1e8ce04ab6 in opal_libevent2022_event_base_loop (base=0xc2cb40, flags=1) at event.c:1630
#3  0x00007f1e8cda79e9 in progress_engine (obj=0xc172d8) at runtime/opal_progress_threads.c:105
#4  0x00007f1e8be22dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f1e8bb5173d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7f1e83fff700 (LWP 207207)):
#0  0x00007f1e8bb51d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f1e8ce017f8 in epoll_dispatch (base=0xc5ae10, tv=<optimized out>) at epoll.c:407
#2  0x00007f1e8ce04ab6 in opal_libevent2022_event_base_loop (base=0xc5ae10, flags=1) at event.c:1630
#3  0x00007f1e892ff591 in progress_engine (obj=0xc5ad68) at runtime/pmix_progress_threads.c:110
#4  0x00007f1e8be22dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f1e8bb5173d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7f1e8d96b740 (LWP 207205)):
#0  0x00007f1e8bb46dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007f1e8ce0dff8 in poll_dispatch (base=0xc13dd0, tv=<optimized out>) at poll.c:165
#2  0x00007f1e8ce04ab6 in opal_libevent2022_event_base_loop (base=0xc13dd0, flags=2) at event.c:1630
#3  0x00007f1e8cda0799 in opal_progress () at runtime/opal_progress.c:204
#4  0x00007f1e8d4191f1 in ompi_request_wait_completion (req=0xd08600) at ../ompi/request/request.h:392
#5  0x00007f1e8d41922f in ompi_request_default_wait (req_ptr=0x7ffd57a909c0, status=0x0) at request/req_wait.c:42
#6  0x00007f1e8d4b214a in ompi_coll_base_bcast_intra_generic (buffer=0x7ffd57a90f48, original_count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0, 
    count_by_segment=1, tree=0xd0abe0) at base/coll_base_bcast.c:209
#7  0x00007f1e8d4b282e in ompi_coll_base_bcast_intra_binomial (buffer=0x7ffd57a90f48, count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0, segsize=0)
    at base/coll_base_bcast.c:335
#8  0x00007f1e7b9e3e82 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x7ffd57a90f48, count=1, datatype=0x6021c0 <ompi_mpi_int>, root=0, comm=0x6025c0 <ompi_mpi_comm_world>, module=0xd09ad0)
    at coll_tuned_decision_fixed.c:258
#9  0x00007f1e8d3f9965 in ompi_dpm_connect_accept (comm=0x6025c0 <ompi_mpi_comm_world>, root=0, port_string=0xd08ab0 "2558984193.0:2441757066", send_first=true, newcomm=0x7ffd57a911d0) at dpm/dpm.c:239
#10 0x00007f1e8d3fe9d3 in ompi_dpm_dyn_init () at dpm/dpm.c:1004
#11 0x00007f1e8d41c49a in ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7ffd57a9143c) at runtime/ompi_mpi_init.c:949
#12 0x00007f1e8d45a147 in PMPI_Init (argc=0x0, argv=0x0) at pinit.c:68
#13 0x0000000000400d23 in main (argc=1, argv=0x7ffd57a915c8) at simple_spawn.c:19

Anyone have thoughts?

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions