Skip to content

Conversation

@naughtont3
Copy link
Contributor

The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.

A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it.

Use the PMIx distance generation for this purpose.

Signed-off-by: Amir Shehata [email protected]
(cherry picked from commit d919232)

The existing code in compare_cpusets assumed that some non_io ancestor of a
PCI object should intersect with the cpuset of the proc. However, this is
not true. There is a case where the non IO ancestor can be an L3. If there
exists two L3s on the same NUMA and the process is bound to one L3, but
the PCI object is connected to the other L3, then compare_cpusets() will
return false.

A better way to determine the optimal interface is by finding the
distances of the interfaces from the current process. Then find out which
of these interfaces is nearest the process and select it.

Use the PMIx distance generation for this purpose.

Signed-off-by: Amir Shehata <[email protected]>
(cherry picked from commit d919232)
@naughtont3
Copy link
Contributor Author

@wckzhang Would you mind reviewing this for v5.0.x branch? This is cherry-pick of #11338.

@naughtont3
Copy link
Contributor Author

@amirshehataornl Added for v5.0.x

@naughtont3 naughtont3 requested a review from wzamazon April 5, 2023 03:01
@lrbison
Copy link
Contributor

lrbison commented Apr 5, 2023

Please do not merge. I find that this commit broke main's compile for gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15).

@janjust
Copy link
Contributor

janjust commented Apr 6, 2023

@lrbison @naughtont3 @wzamazon What do you guys want to do here, will you update this PR with the corresponding main? Is this PR urgent?

@lrbison
Copy link
Contributor

lrbison commented Apr 6, 2023

I'll leave someone else to comment about the priority and urgency, but I'm scratching my head wondering how this commit passed CI. It looks like it uses a macro defined in 3rd-party/openpmix/include/pmix_deprecated.h which is clearly broken, yet CI said these changes were OK. Is CI not using the same submodules as the main branch?

macro in pmix_deprecated.h:

// free(m) is called inside PMIx_Topology_free().
#define PMIX_TOPOLOGY_FREE(m, n)  \
    do {                          \
        PMIx_Topology_free(m, n); \
        (m) = NULL;               \
    while (0)

Note the missing closing }.

@amirshehataornl
Copy link
Contributor

@lrbison the macro issue has been resolved here: openpmix/openpmix#3035
@janjust I have pushed a new PR which moves away from using the deprecated macros: #11565

@rhc54
Copy link
Contributor

rhc54 commented Apr 6, 2023

Is CI not using the same submodules as the main branch?

No - it uses the submodules as they currently are defined in the v5.0 branch, which lags behind the main branch. Likewise, the PR in the main branch used the submodule definitions that were current at the time of filing, not as they were later after update.

@lrbison
Copy link
Contributor

lrbison commented Apr 6, 2023

oops, sorry, I meant to comment on the main-branch PR, not this one for the v5 branch. Makes sense that this one would use v5.0 of course.

Likewise, the PR in the main branch used the submodule definitions that were current at the time of filing, not as they were later after update

Right. But main was in a state where it wouldn't compile, and I'm wondering how that happened. So what must have happened is that the PR to main was OK, but some bump to pmix pointers should have failed CI but wasn't caught?

I'm not trying to point fingers at anyone here, I'm just trying to make sure we don't have a hole in CI testing somewhere.

@amirshehataornl
Copy link
Contributor

I think what happened is the patch was pushed against a PMIx which didn't have this problem and sat there for a while and when it was merged, it didn't get rebased on the latest and greatest first, therefore this issue was missed. And since there was no conflict in the OMPI code when it was merged to main, the issue was never noticed until the patch landed.

@lrbison
Copy link
Contributor

lrbison commented Apr 6, 2023

That makes sense Amir. Thank you. Just unfortunate I guess!

@amirshehataornl
Copy link
Contributor

amirshehataornl commented Apr 6, 2023

one idea to avoid this in the future is for the CI to compare the 3rdparty submodules which the patch has been compiled against with the main submodules and if the versions have changed block merging until the patch is rebased. I think if that's part of the process we will catch similar issues to this one. Just not sure how feasible this is with the CI. I'm not very familiar with it.

@awlauria
Copy link
Contributor

This was reverted out of main here: #11562

@awlauria awlauria closed this Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants