Skip to content

Conversation

bgoglin
Copy link
Contributor

@bgoglin bgoglin commented Nov 27, 2019

This should be backported to 4.0.x and 3.1.x (I am not sure how you guys handle this)

Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/[email protected]/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin [email protected]

Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/[email protected]/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <[email protected]>
@hppritcha
Copy link
Member

@bgoglin this should be backported to 4.0.x?

@hppritcha
Copy link
Member

Sorry didn’t read you commit message.

@hppritcha
Copy link
Member

@bgoglin when I try to test this PR on an x86_64 dual core haswell system it doesn't seem to be working:

hpp@cn165:~>mpirun -np 48 ./test_numa_bug
---- WORLD Communicator size: 48
**** SHARED Communicator size: 48
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 27, 2020

@hppritcha Strange, I just tested again. Git master (last nightly snapshot) fails as expected, patch from this PR works fine. Using the internal hwloc. On dual-socket 12-core haswell with Cluster-on-Die (2 sockets, 2 NUMA nodes each, 6 core each).

@gpaulsen
Copy link
Member

@bgoglin what is your mpirun command line, specifically hostfile details? I'm trying on ppc64le as well.

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 29, 2020

There's nothing interesting in my command-line, I am running on a single node with: mpiexec --mca mtl ^psm,psm2 -np 24 ~/split

By the way, I added more tests to the test program to confirm that only the NUMA level was broken:

#include <stdio.h>
#include "mpi.h"

int main(){
    MPI_Init(NULL, NULL);

    int world_size, shared_size, numa_size, sock_size, l3_size;
    int world_rank, shared_rank, numa_rank, sock_rank, l3_rank;
    int key = 0;
    MPI_Comm shared_comm, numa_comm, sock_comm, l3_comm;

    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, MPI_INFO_NULL, &shared_comm);
    MPI_Comm_size(shared_comm, &shared_size);
    MPI_Comm_rank(shared_comm, &shared_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_SOCKET, key, MPI_INFO_NULL, &sock_comm);
    MPI_Comm_size(sock_comm, &sock_size);
    MPI_Comm_rank(sock_comm, &sock_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NUMA, key, MPI_INFO_NULL, &numa_comm);
    MPI_Comm_size(numa_comm, &numa_size);
    MPI_Comm_rank(numa_comm, &numa_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, key, MPI_INFO_NULL, &l3_comm);
    MPI_Comm_size(l3_comm, &l3_size);
    MPI_Comm_rank(l3_comm, &l3_rank);

    if(world_rank == 0){
       printf("---- WORLD Communicator size: %d\n", world_size);
    }
    if(shared_rank == 0){
      printf("**** SHARED Communicator size: %d\n", shared_size);
    }
    if(sock_rank == 0){
      printf("#### SOCK Communicator size: %d\n", sock_size);
    }
    if(numa_rank == 0){
      printf("#### NUMA Communicator size: %d\n", numa_size);
    }
    if(l3_rank == 0){
      printf("#### L3 Communicator size: %d\n", l3_size);
    }
    
    MPI_Finalize();
    return 0;
}

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 29, 2020

I also tried with 4 of those haswell nodes in a single job with 96 ranks, works fine.
And then between 144 ranks on 4 skylake nodes (2 sockets x 1 numa x 18 cores), no problem either.

@hppritcha
Copy link
Member

okay not sure what was going on but now works for me, at least on ARM TX2 nodes.

@hppritcha hppritcha merged commit d2b68e6 into open-mpi:master Feb 3, 2020
hppritcha added a commit to hppritcha/ompi that referenced this pull request Feb 3, 2020
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Feb 4, 2020
PR open-mpi#7201 broke use of hwloc 1.x series.  this patches gets hwloc 1.x working again with OMPI

fixes open-mpi#7362

Signed-off-by: Howard Pritchard <[email protected]>
cniethammer pushed a commit to cniethammer/ompi that referenced this pull request May 10, 2020
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants