-
Notifications
You must be signed in to change notification settings - Fork 931
Closed
Description
Background information
shell$ mpirun -V
mpirun (Open MPI) 4.0.4
shell$ ompi_info | grep -i ucx
Configure command line: '--prefix=/project/dsi/apps/easybuild/software/OpenMPI/4.0.4-iccifort-2019.5.281' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-verbs' '--with-hwloc=/project/dsi/apps/easybuild/software/hwloc/2.2.0-GCCcore-8.3.0' '--with-ucx=/project/dsi/apps/easybuild/software/UCX/1.8.0-GCCcore-8.3.0'
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.4)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.4)
shell$ uname -or
3.10.0-1160.11.1.el7.x86_64 GNU/Linux
shell$ srun -V
slurm 20.02.5
shell$ ucx_info -u t -e
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 2:self/memory md[2] -> md[2]/self am am_bw#0
# lane[1]: 8:rc_mlx5/mlx5_0:1 md[5] -> md[5]/ib rma_bw#0 wireup{ud_mlx5/mlx5_0:1}
# lane[2]: 13:cma/memory md[7] -> md[7]/cma rma_bw#1
#
# tag_send: 0..<egr/short>..8185..<egr/bcopy>..8192..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..8185..<egr/bcopy>..8192..<rndv>..(inf)
#
# rma_bw: mds [5] rndv_rkey_size 18
#
Details of the problem
C code: test_mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define NUM_SPAWNS 2
int
main(int argc, char** argv)
{
int errcodes[NUM_SPAWNS];
MPI_Comm parentcomm, intercomm;
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&parentcomm);
if (parentcomm == MPI_COMM_NULL) {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// problem here
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NUM_SPAWNS, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, errcodes);
printf("Parent %d\n", rank);
MPI_Bcast(&rank, 1, MPI_INT, MPI_ROOT, intercomm);
} else {
int rank, parent_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Bcast(&parent_rank, 1, MPI_INT, 0, parentcomm);
printf("Child %d of parent %d\n", rank, parent_rank);
}
fflush(stdout);
MPI_Finalize();
return 0;
}
Compile&Run
shell$ mpicc test_mpi.c -o test_mpi
shell$ srun -N 2 --ntasks-per-node 3 --pty /bin/bash -l
shell$ mpirun --map-by ppr:1:node --bind-to core --mca btl '^openib,uct' --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 --report-bindings ./test_mpi
Sometime I get
[compute-5-1.local:45181] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/./.][]
[compute-5-2.local:41401] MCW rank 1 bound to socket 1[core 0[hwt 0]]: [][B/./.]
[compute-5-1.local:45181] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B][]
[compute-5-1.local:45181] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/.][]
[compute-5-2.local:41401] MCW rank 0 bound to socket 1[core 1[hwt 0]]: [][./B/.]
[compute-5-2.local:41401] MCW rank 1 bound to socket 1[core 2[hwt 0]]: [][././B]
[1611013241.802944] [compute-5-1:45191:0] wireup.c:315 UCX ERROR ep 0x2ac27172a048: no remote ep address for lane[1]->remote_lane[1]
Parent 0
[1611013241.803457] [compute-5-2:41405:0] wireup.c:315 UCX ERROR ep 0x2b070d728090: no remote ep address for lane[1]->remote_lane[1]
Child 0 of parent 0
Child 1 of parent 0
Parent 1
Child 0 of parent 1
Child 1 of parent 1
[1611013241.806914] [compute-5-2:41405:0] wireup.c:315 UCX ERROR ep 0x2b070d728048: no remote ep address for lane[1]->remote_lane[1]
[1611013241.819399] [compute-5-1:45191:0] wireup.c:315 UCX ERROR ep 0x2ac27172a090: no remote ep address for lane[1]->remote_lane[1]
which is, except for the four UCX ERROR lines, the expected output. But sometime I get
shell$ mpirun --map-by ppr:1:node --bind-to core --mca btl '^openib,uct' --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 --report-bindings ./test_mpi
[compute-5-1.local:44689] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/./.][]
[compute-5-2.local:40943] MCW rank 1 bound to socket 1[core 0[hwt 0]]: [][B/./.]
[compute-5-1.local:44689] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B][]
[compute-5-1.local:44689] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/.][]
[compute-5-2.local:40943] MCW rank 0 bound to socket 1[core 1[hwt 0]]: [][./B/.]
[compute-5-2.local:40943] MCW rank 1 bound to socket 1[core 2[hwt 0]]: [][././B]
[compute-5-1.local:44699] pml_ucx.c:176 Error: Failed to receive UCX worker address: Not found (-13)
[compute-5-1.local:44699] [[57305,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[compute-5-1:44699] *** An error occurred in MPI_Comm_spawn
[compute-5-1:44699] *** reported by process [3755540481,0]
[compute-5-1:44699] *** on communicator MPI_COMM_SELF
[compute-5-1:44699] *** MPI_ERR_OTHER: known error not in list
[compute-5-1:44699] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[compute-5-1:44699] *** and potentially your MPI job)
[compute-5-2.local:40947] pml_ucx.c:176 Error: Failed to receive UCX worker address: Not found (-13)
[compute-5-2.local:40947] [[57305,1],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[compute-5-1.local:44689] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[compute-5-1.local:44689] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[compute-5-1.local:44689] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Same result even without specifying --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1
Metadata
Metadata
Assignees
Labels
No labels