-
Notifications
You must be signed in to change notification settings - Fork 936
Closed
Description
With the master branch of OpenMPI (ba46b35) and I have a problem when I use a rankfile.
shell$ mpirun -np 2 -machinefile mf -rf rf ./test
[miriel044:01383] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135838] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel044:01383] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.]
[miriel045:135838] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.]
[miriel044:01389] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135890] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135890] PMIX ERROR: ERROR STRING NOT FOUND in file ../../../../../../../../../../src/opal/mca/pmix/pmix2x/pmix/src/mca/ptl/tcp/ptl_tcp.c at line 299
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[miriel045:135890] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[miriel044:01389] mca_base_component_repository_open: unable to open mca_mtl_psm2: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21140,1],1]
Exit code: 1
where mf is
miriel044 slots=24
miriel045 slots=24
and rf is
rank 0=miriel044 slot=0
rank 1=miriel045 slot=0
If I use hostname instead of ./test it works.
Since the commit 48fc339, it does not work (but the error changed, first, it was a not enough slots available problem).
Metadata
Metadata
Assignees
Labels
No labels