Skip to content

Conversation

@hppritcha
Copy link
Member

No description provided.

bgoglin and others added 2 commits February 3, 2020 13:29
Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/[email protected]/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit ea80a20)
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha requested a review from gpaulsen February 3, 2020 21:26
@hppritcha hppritcha added the NEWS label Feb 3, 2020
@hppritcha
Copy link
Member Author

@gpaulsen for some reason the opal_asprintf not being defined in @bgoglin 's original commit was causing build to fail for me so had to patch in an include to opal/util/print.h.

Verified test case passes.

@gpaulsen
Copy link
Member

gpaulsen commented Feb 3, 2020

I'll give it a build / test tomorrow and then merge this after a review.

@hppritcha
Copy link
Member Author

@gpaulsen don't merge yet. this will break backward compatibility with hwloc 1 series.

@gpaulsen
Copy link
Member

gpaulsen commented Feb 5, 2020

I assume the issues you're referring to are resolved by #7366 ?

@bgoglin
Copy link
Contributor

bgoglin commented Feb 5, 2020

@gpaulsen Yes, they should.

@gpaulsen
Copy link
Member

Okay, I pushed the commit from #7366 to this branch as well.

@gpaulsen gpaulsen added this to the v4.0.3 milestone Feb 10, 2020
@gpaulsen gpaulsen linked an issue Feb 10, 2020 that may be closed by this pull request
@bgoglin
Copy link
Contributor

bgoglin commented Feb 10, 2020

@gpaulsen It's actually the other commit from #7366 that is important: 907ad85
The one you took is only for fixing warnings.

@gpaulsen
Copy link
Member

oops. Thanks for double checking. I'll cherry-pick again in the next hour.

Build was broken by mistake in commit d40662edc41a5a4d09ae690b640cfdeeb24e15a1

Fixes open-mpi#7362

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit 907ad85)
@gpaulsen
Copy link
Member

Ok, please re-review.

Note that I also changed f136804#diff-f4ec0cf4c6659e1c4af03b7a807378d8R2262 to asprintf. I think that was correct, but wanted to point it out.

@hppritcha
Copy link
Member Author

@gpaulsen looks good and ready to merge

@gpaulsen gpaulsen merged commit a1259e6 into open-mpi:v4.0.x Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Communicator Split Type NUMA Behavior

3 participants