Skip to content

Conversation

@amirshehataornl
Copy link
Contributor

The original thread of discussion was closed when I deleted my local repo. here it is for reference: #11206

The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.

A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it.

Use the PMIx distance generation for this purpose.

Signed-off-by: Amir Shehata [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@amirshehataornl
Copy link
Contributor Author

@rhc54, sorry i deleted my original repo, which closed the original PR. Still wrapping my mind around github workflow. Here is the original PR for reference: #11206

I resynched with latest and greatest to pass the CI (hopefully)

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you lost a commit in the process.

distances = (pmix_device_distance_t*)dptr->array;

for (i = 0; i < dptr->size; i++)
fprintf(stderr, "%d: %d:%s:%d:%d\n", getpid(), i, distances[i].uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you meant to take this debug out?


for(osdev = pcidev->io_first_child; osdev != NULL; osdev = osdev->next_sibling) {
int i;
const char *address = hwloc_obj_get_info_by_name(osdev, "Address");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you lost something here - you need to address the two cases: (a) where the device is declared to be "network" (which has the "address" field), and (b) where the device is declared to be "openfabric" (which has "nodeguid" and "systemguid", IIRC).

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2023

FWIW: if you go back to the prior PR (#11206), you can simply recover the diff by adding ".diff" to the PR URL - it sends you to https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/11206.diff. You can then save that diff and apply it to your new repo to fully recover where you were at.

@amirshehataornl
Copy link
Contributor Author

sorry about that. I was working on two different computers and didn't sync with my latest changes. This should bring back the exact same code.

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2023

Looks good to me!

@amirshehataornl
Copy link
Contributor Author

@rhc54, thinking about it some more, is there a situation where Open MPI would not have access to PMIx? IE, it's not compiled with it? Do we need to handle this scenario?

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2023

Beginning with OMPI v5, PMIx support is required. It is possible that the server the app process connects to won't provide distance info - but PMIx support itself must be there.

@amirshehataornl
Copy link
Contributor Author

Should we have a path in the code then, where we check if distance info is available and if not, we calculate it ourselves?

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2023

Safer - or have a fallback that may not be as optimal. Best is both since you may not be able to get the topology or it may not include things you recognize.

@rhc54
Copy link
Contributor

rhc54 commented Jan 25, 2023

Let me provide a little more direction:

  1. Attempt to get the device distance info directly using PMIx_Get.

  2. If that doesn't succeed, then try to get your cpuset (PMIx_Get_cpuset) and compute the device distances (PMIx_Compute_distances) for yourself.

  3. If that fails (e.g., cannot get a cpuset - you might be unbound - or the distances cannot be computed for some reason), then you need a fallback method for selecting the device to use. In this case, it probably won't matter much which one you use - e.g., if you aren't bound, then they are all equal - so maybe just use a round-robin assignment (e.g., rank % num_devices).

Hope that helps.

@amirshehataornl
Copy link
Contributor Author

@rhc54, thanks. I was thinking along the same lines. I'll update the patch sometime today.

@amirshehataornl
Copy link
Contributor Author

@rhc54, when you have a chance, if you could give it another look to see if it's good to go, that would be great. Thanks.

@rhc54
Copy link
Contributor

rhc54 commented Jan 26, 2023

Looks okay to me - I think this will do what you wanted 😄

@amirshehataornl
Copy link
Contributor Author

@rhc54 cleaned up one place in the code.
What are the next steps for it to end up landing?

@rhc54
Copy link
Contributor

rhc54 commented Jan 27, 2023

Need to get someone from the OMPI community to approve CI for it and to review it for commit. Afraid I can't help with either of those as I'm not part of that community. I think @hppritcha or @bwbarrett might fit the bill, or can assign someone they feel appropriate.

@hppritcha
Copy link
Member

okay to test

@amirshehataornl
Copy link
Contributor Author

@hppritcha, is there anything left for this patch to be done?

@hppritcha hppritcha requested a review from wckzhang January 30, 2023 19:22
@hppritcha
Copy link
Member

@wckzhang could you review this PR?

@wckzhang
Copy link
Contributor

Okay let me look at it today

@bgoglin
Copy link
Contributor

bgoglin commented Mar 9, 2023

@amirshehataornl FYI, coming back to the old discussion about Cray NICs in hwloc: I now have access to a machine with Slingshot NICs (nodes identical to Frontier). If you need something from hwloc, I'll be able to look into it. For now I am just going to mark these Ethernet NIC with the "Slingshot" subtype string.

@amirshehataornl
Copy link
Contributor Author

@amirshehataornl FYI, coming back to the old discussion about Cray NICs in hwloc: I now have access to a machine with Slingshot NICs (nodes identical to Frontier). If you need something from hwloc, I'll be able to look into it. For now I am just going to mark these Ethernet NIC with the "Slingshot" subtype string.

Would you be able to review the patch again and see if the changes you're thinking about will break it?

@bgoglin
Copy link
Contributor

bgoglin commented Mar 12, 2023

@amirshehataornl FYI, coming back to the old discussion about Cray NICs in hwloc: I now have access to a machine with Slingshot NICs (nodes identical to Frontier). If you need something from hwloc, I'll be able to look into it. For now I am just going to mark these Ethernet NIC with the "Slingshot" subtype string.

Would you be able to review the patch again and see if the changes you're thinking about will break it?

I don't see anything that would break (you're not looking at the subtype attribute that I will set).

* hwloc_cpuset_intersects()
*/
static bool compare_cpusets(hwloc_topology_t topology, struct fi_pci_attr pci)
static int calculate_distances(pmix_device_distance_t **distances,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more appropriately named to discover_devices_and_calc_distances or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to compute_dev_distances() as we're not really discovering devices. HWLOC has already discovered the topology. We're just using it to calculate the distance to the devices.

if (NULL == obj) {
goto error;
}
if (osdev->attr->osdev.type == HWLOC_OBJ_OSDEV_OPENFABRICS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with this, are these the only two types osdev.type can be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe network devices can only be of these two types.

char lsguid[256], lnguid[256];
int ret;

ret = scanf(distances[i].uuid, "fab://%256s::%256s", lnguid, lsguid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be sscanf? I can't see why we would want to read from stdin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

The existing code in compare_cpusets assumed that some non_io ancestor of a
PCI object should intersect with the cpuset of the proc. However, this is
not true. There is a case where the non IO ancestor can be an L3. If there
exists two L3s on the same NUMA and the process is bound to one L3, but
the PCI object is connected to the other L3, then compare_cpusets() will
return false.

A better way to determine the optimal interface is by finding the
distances of the interfaces from the current process. Then find out which
of these interfaces is nearest the process and select it.

Use the PMIx distance generation for this purpose.

Signed-off-by: Amir Shehata <[email protected]>
Copy link
Contributor

@wckzhang wckzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this, but I'd like to know what testing was done since I'm surprised the scanf didn't get caught on compilation.

@amirshehataornl
Copy link
Contributor Author

I don't have have access to a device which has the type: HWLOC_OBJ_OSDEV_OPENFABRICS. I tested on crusher and the CXI device registers as a NETWORK device. So my tests didn't hit the OPENFABRICS path of the new code.

@amirshehataornl
Copy link
Contributor Author

@hppritcha, @jsquyres Do we need anything else for this PR or is it good to land?

@naughtont3
Copy link
Contributor

@janjust It would be good to have this in v5.0.x branch too once it lands in main. What's proper process for taking to that branch, post another PR for v5.0.x?

@wckzhang
Copy link
Contributor

checkout that branch and do a git cherry-pick -x <commit hash> and create a PR against v5.0.x branch

@wckzhang
Copy link
Contributor

There are instructions regarding submitting pull requests to branches here - https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests

@naughtont3 naughtont3 merged commit ac2cfc1 into open-mpi:main Apr 4, 2023
@naughtont3
Copy link
Contributor

looks like this may cause a problem building main. did a temporary revert for now to avoid breaking main while look into issue.

@lrbison
Copy link
Contributor

lrbison commented Apr 5, 2023

Here is a log from a failing compile

@vidsouza
Copy link

vidsouza commented Apr 5, 2023

Root causes of the issue:

  1. In line 499, a MARCO's definition is used which corresponds to a do while loop. The do statement is not closed with a closing brackets }
  2. In line 616, the struct one of the variables first_child of the struct hwloc_obj_t is accessed incorrectly as io_first_child.

Other than these two main errors, there were few simple warnings that can be looked into but is not related to this PR.

common_ofi.c: In function 'get_nearest_nics':
common_ofi.c:559:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (i = 0; i < ndist; i++) {
                   ^
common_ofi.c:521:22: warning: unused variable 'topo' [-Wunused-variable]
     pmix_topology_t *topo;
                      ^~~~
common_ofi.c: In function 'opal_common_ofi_select_provider':
common_ofi.c:775:34: warning: unused variable 'cpusets_match' [-Wunused-variable]
     bool provider_found = false, cpusets_match = false;

@amirshehataornl
Copy link
Contributor Author

amirshehataornl commented Apr 5, 2023

#11565

1 above: I'll push in a separate PMIx patch to resolve this. I moved away from using these deprecated Macros. using the functions directly
2 above: the intention is to use io_first_child. That field is in the structure (at least in main)

Thanks for pointing out the warnings. Resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants