-
Notifications
You must be signed in to change notification settings - Fork 931
mtl/ofi: NIC selection update #11338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you lost a commit in the process.
opal/mca/common/ofi/common_ofi.c
Outdated
| distances = (pmix_device_distance_t*)dptr->array; | ||
|
|
||
| for (i = 0; i < dptr->size; i++) | ||
| fprintf(stderr, "%d: %d:%s:%d:%d\n", getpid(), i, distances[i].uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you meant to take this debug out?
opal/mca/common/ofi/common_ofi.c
Outdated
|
|
||
| for(osdev = pcidev->io_first_child; osdev != NULL; osdev = osdev->next_sibling) { | ||
| int i; | ||
| const char *address = hwloc_obj_get_info_by_name(osdev, "Address"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you lost something here - you need to address the two cases: (a) where the device is declared to be "network" (which has the "address" field), and (b) where the device is declared to be "openfabric" (which has "nodeguid" and "systemguid", IIRC).
|
FWIW: if you go back to the prior PR (#11206), you can simply recover the diff by adding ".diff" to the PR URL - it sends you to https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/11206.diff. You can then save that diff and apply it to your new repo to fully recover where you were at. |
2ea2f93 to
d22e881
Compare
|
sorry about that. I was working on two different computers and didn't sync with my latest changes. This should bring back the exact same code. |
|
Looks good to me! |
|
@rhc54, thinking about it some more, is there a situation where Open MPI would not have access to PMIx? IE, it's not compiled with it? Do we need to handle this scenario? |
|
Beginning with OMPI v5, PMIx support is required. It is possible that the server the app process connects to won't provide distance info - but PMIx support itself must be there. |
|
Should we have a path in the code then, where we check if distance info is available and if not, we calculate it ourselves? |
|
Safer - or have a fallback that may not be as optimal. Best is both since you may not be able to get the topology or it may not include things you recognize. |
|
Let me provide a little more direction:
Hope that helps. |
|
@rhc54, thanks. I was thinking along the same lines. I'll update the patch sometime today. |
d22e881 to
edbc40c
Compare
|
@rhc54, when you have a chance, if you could give it another look to see if it's good to go, that would be great. Thanks. |
|
Looks okay to me - I think this will do what you wanted 😄 |
edbc40c to
cf15573
Compare
|
@rhc54 cleaned up one place in the code. |
|
Need to get someone from the OMPI community to approve CI for it and to review it for commit. Afraid I can't help with either of those as I'm not part of that community. I think @hppritcha or @bwbarrett might fit the bill, or can assign someone they feel appropriate. |
|
okay to test |
|
@hppritcha, is there anything left for this patch to be done? |
|
@wckzhang could you review this PR? |
|
Okay let me look at it today |
|
@amirshehataornl FYI, coming back to the old discussion about Cray NICs in hwloc: I now have access to a machine with Slingshot NICs (nodes identical to Frontier). If you need something from hwloc, I'll be able to look into it. For now I am just going to mark these Ethernet NIC with the "Slingshot" subtype string. |
cf15573 to
72b0faa
Compare
Would you be able to review the patch again and see if the changes you're thinking about will break it? |
I don't see anything that would break (you're not looking at the subtype attribute that I will set). |
opal/mca/common/ofi/common_ofi.c
Outdated
| * hwloc_cpuset_intersects() | ||
| */ | ||
| static bool compare_cpusets(hwloc_topology_t topology, struct fi_pci_attr pci) | ||
| static int calculate_distances(pmix_device_distance_t **distances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more appropriately named to discover_devices_and_calc_distances or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to compute_dev_distances() as we're not really discovering devices. HWLOC has already discovered the topology. We're just using it to calculate the distance to the devices.
| if (NULL == obj) { | ||
| goto error; | ||
| } | ||
| if (osdev->attr->osdev.type == HWLOC_OBJ_OSDEV_OPENFABRICS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with this, are these the only two types osdev.type can be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe network devices can only be of these two types.
opal/mca/common/ofi/common_ofi.c
Outdated
| char lsguid[256], lnguid[256]; | ||
| int ret; | ||
|
|
||
| ret = scanf(distances[i].uuid, "fab://%256s::%256s", lnguid, lsguid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be sscanf? I can't see why we would want to read from stdin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false. A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it. Use the PMIx distance generation for this purpose. Signed-off-by: Amir Shehata <[email protected]>
72b0faa to
d919232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this, but I'd like to know what testing was done since I'm surprised the scanf didn't get caught on compilation.
|
I don't have have access to a device which has the type: HWLOC_OBJ_OSDEV_OPENFABRICS. I tested on crusher and the CXI device registers as a NETWORK device. So my tests didn't hit the OPENFABRICS path of the new code. |
|
@hppritcha, @jsquyres Do we need anything else for this PR or is it good to land? |
|
@janjust It would be good to have this in v5.0.x branch too once it lands in main. What's proper process for taking to that branch, post another PR for v5.0.x? |
|
checkout that branch and do a |
|
There are instructions regarding submitting pull requests to branches here - https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests |
|
looks like this may cause a problem building main. did a temporary revert for now to avoid breaking main while look into issue. |
|
Here is a log from a failing compile |
|
Root causes of the issue:
Other than these two main errors, there were few simple warnings that can be looked into but is not related to this PR. |
|
1 above: I'll push in a separate PMIx patch to resolve this. I moved away from using these deprecated Macros. using the functions directly Thanks for pointing out the warnings. Resolved. |
The original thread of discussion was closed when I deleted my local repo. here it is for reference: #11206
The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.
A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it.
Use the PMIx distance generation for this purpose.
Signed-off-by: Amir Shehata [email protected]