Skip to content

Cannot use rankfile option #3657

@cbordage

Description

@cbordage

With the master branch of OpenMPI (ba46b35) and I have a problem when I use a rankfile.

shell$ mpirun -np 2 -machinefile mf -rf rf ./test
[miriel044:01383] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135838] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel044:01383] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.]
[miriel045:135838] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.]
[miriel044:01389] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135890] pmix_mca_base_component_repository_open: unable to open mca_pnet_opa: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
[miriel045:135890] PMIX ERROR: ERROR STRING NOT FOUND in file ../../../../../../../../../../src/opal/mca/pmix/pmix2x/pmix/src/mca/ptl/tcp/ptl_tcp.c at line 299
*** An error occurred in MPI_Init                      
*** on a NULL communicator 
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,                                       
***    and potentially your MPI job)                   
[miriel045:135890] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------                                                        
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------                                                        
[miriel044:01389] mca_base_component_repository_open: unable to open mca_mtl_psm2: libpsm2.so.2: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------                                     
mpirun detected that one or more processes exited with non-zero status, thus causing                           
the job to be terminated. The first process to do so was:

  Process name: [[21140,1],1]                          
  Exit code:    1          

where mf is

miriel044 slots=24
miriel045 slots=24

and rf is

rank 0=miriel044 slot=0
rank 1=miriel045 slot=0

If I use hostname instead of ./test it works.

Since the commit 48fc339, it does not work (but the error changed, first, it was a not enough slots available problem).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions