Skip to content

Can we allow for site-specific tuning of OpenMPI? #456

@ocaisa

Description

@ocaisa

I ran into this issue that a simple mpi4py code would not run on a Magic Castle deployment with EESSI (though it works on the same system with the pilot):

[ocaisa@node1 ~]$ cat bcast.py 
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    data = {'a':1,'b':2,'c':3}
else:
    data = None

data = comm.bcast(data, root=0)
print('rank %d : %s'% (rank,data))

[ocaisa@login1 ~]$ module purge
[ocaisa@login1 ~]$ echo $MODULEPATH
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module use /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module load SciPy-bundle/2021.05-foss-2021a
[ocaisa@login1 ~]$ mpirun -n 2 python bcast.py 
rank 0 : {'a': 1, 'b': 2, 'c': 3}
rank 1 : {'a': 1, 'b': 2, 'c': 3}

[ocaisa@login1 ~]$ module purge
[ocaisa@login1 ~]$ module unuse /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module load mpi4py
[ocaisa@login1 ~]$ module list

Currently Loaded Modules:
  1) GCCcore/12.3.0                  5) libpciaccess/0.17-GCCcore-12.3.0   9) UCX/1.14.1-GCCcore-12.3.0        13) OpenMPI/4.1.5-GCC-12.3.0      17) libffi/3.4.4-GCCcore-12.3.0
  2) GCC/12.3.0                      6) hwloc/2.9.1-GCCcore-12.3.0        10) libfabric/1.18.0-GCCcore-12.3.0  14) gompi/2023a                   18) Python/3.11.3-GCCcore-12.3.0
  3) numactl/2.0.16-GCCcore-12.3.0   7) OpenSSL/1.1                       11) PMIx/4.2.4-GCCcore-12.3.0        15) Tcl/8.6.13-GCCcore-12.3.0     19) mpi4py/3.1.4-gompi-2023a
  4) libxml2/2.11.4-GCCcore-12.3.0   8) libevent/2.1.12-GCCcore-12.3.0    12) UCC/1.2.0-GCCcore-12.3.0         16) SQLite/3.42.0-GCCcore-12.3.0

[ocaisa@login1 ~]$ mpirun -n 2 python bcast.py 
login1.int.jetstream2.hpc-carpentry.org:rank0.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank0: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank1.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank1: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank1.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank1: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank0.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank0: PSM3 can't open nic unit: 1 (err=23)
(hanging)

It turns out this issue was already "solved" for an EasyBuild use case, which also resolved things for my case.

It raises the issue though that OpenMPI may need to be configured to work correctly on the host site (and indeed this was also raised in #1 ). @bartoldeman explained how they account for this in Compute Canada:

the way we solve this (for the soft.computecanada.ca stack) is to set an environment variable RSNT_INTERCONNECT using this logic in lmod:

function get_interconnect()
        local posix = require "posix"
        if posix.stat("/sys/module/opa_vnic","type") == 'directory' then
                return "omnipath"
        elseif posix.stat("/sys/module/ib_core","type") == 'directory' then
                return "infiniband"
        end
        return "ethernet"
end

for "ethernet" we have:

OMPI_MCA_btl='^openib,ofi'
OMPI_MCA_mtl='^ofi'
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'

so libfabric (OFI) isn't used by Open MPI which eliminates any use of PSM3 as well, it'll basically force Open MPI to use either the tcp or vader (shm) + self btl , with the ob1 pml, no runtime use of UCX nor OFI.
I'm not sure if EESSI still compiles Open MPI with support for openib, if not, the first one could be OMPI_MCA_btl='^ofi'
for "infiniband" it's:

OMPI_MCA_btl='^openib,ofi'
OMPI_MCA_mtl='^ofi'

to eliminate libfabric as well; Open MPI will use UCX through its priority mechanism.
Lastly for "omnipath" :

OMPI_MCA_btl='^openib'
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'

where we do allow ofi though the priority mechanism will select the cm pml with the psm2 mtl
So basically:

  • always exclude openib (the only use case we have for it is DDT, that's why it's compiled in)
  • infiniband excludes libfabric
  • omnipath excludes UCX
  • ethernet excludes both libfabric and UCX

We set the envvars via a configuration file included in the module, specifically with a modluafooter in the easyconfig:

assert(loadfile("/cvmfs/soft.computecanada.ca/config/lmod/openmpi_custom.lua"))("4.1")

Metadata

Metadata

Assignees

No one assigned

    Labels

    2023.06-software.eessi.io2023.06 version of software.eessi.iobugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions