-
Notifications
You must be signed in to change notification settings - Fork 931
mtl/ofi: NIC selection update #11206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
I would caution that NUMA is no longer a well-defined entity in today's multi-chip packages - you may find that this proposed change behaves in an unexpected manner on other architectures. |
@rhc54, The other option is to use the PMIx distance table you pointed me to in the comment on the issue I opened. You think that would be more reliable? Also if you can give me some initial pointers of where I can see how that is used, that'll be helpful. Thanks |
|
@rhc54 Also, do you recommend a minimum hwloc version that we should use? |
|
bot:ok to test |
|
please make sure to test this PR with a singleton test |
By definition, singleton's in OMPI v5 and above are not bound to anything as all the "self-bind at startup" code has been removed. It is therefore impossible to define a "distance" for them. True, you could bind the singleton at some point - but that is likely a mistake. First, you have no idea what the user would want in terms of binding. Second, any binding that is done will be picked up by PRRTE if the singleton calls "comm_spawn" and used as a constraining window for all subsequent process starts. Probably not what anyone would really want. |
|
Let me answer your questions by starting with a little history. In my former life at Intel (back about 4-5 years ago), I setup a grant for Brice (of HWLOC fame) to work with us on topology discovery/description. We were developing rather complex multi-chip packages and seeing performance issues that suspiciously looked like they were topology related - i.e., the codes weren't properly understanding the MCP topology and mapping/binding in sub-optimal ways. Long story short, the problem was that the codes were trying to operate on the basis of "NUMA" domains - yet the NUMA concept wasn't that clear in these systems. There was memory on each CPU die (with multiple of those in a package), additional memory inside the package (divided between the CPUs by the BIOS - and therefore alterable), memory on the PCIe bus, memory in the GPUs and fabric interfaces, etc - with the type of each of those memories being potentially different (DRAM, NVRAM, etc.). All of that memory was now potentially in the address space of the CPUs in the package - so when you asked to "bind to NUMA", precisely which NUMA(s) are you referring to? Far from clear. This is why you won't find NUMA-to-NUMA distance matrices in the HWLOC topology any more, at least in most cases. Really not very useful. It's also why I keep nagging people over here to update their NIC selection logic and stop using NUMA - but that's another battle. What we are doing in its place is providing every process with a complete distance map of its location to every device of interest on your local node. You can then do with it what you like. We also provide each process with the map for every other process in the job, so you can (for example) know that rank 17 is going to select NIC 3 on its node as that is the closest to it, and then decide to use your NIC 2 to connect to it. At some point in the near future, we plan to add switch info so you can see which NICs share a switch, and thus potentially optimize collectives. PRRTE has been providing the distance info for at least 2 years now. It is currently "off" by default, governed by the PRRTE MCA parameter "pmix_generate_distances": (void) pmix_mca_base_var_register("prte", "pmix", NULL, "generate_distances",
"Device types whose distances are to be provided (default=none, options=fabric,gpu,network",
PMIX_MCA_BASE_VAR_TYPE_BOOL,
&generate_dist);
prte_pmix_server_globals.generate_dist = 0;
if (NULL != generate_dist) {
tmp = PMIX_ARGV_SPLIT_COMPAT(generate_dist, ',');
for (i=0; NULL != tmp[i]; i++) {
if (0 == strcasecmp(tmp[i], "fabric")) {
prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_OPENFABRICS;
} else if (0 == strcasecmp(tmp[i], "gpu")) {
prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_GPU;
} else if (0 == strcasecmp(tmp[i], "network")) {
prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_NETWORK;
}
}
}We can add other device types fairly easily. Note that Brice and I spent a fair amount of time coming up with a reasonable "distance" metric for computing these values. They are purely relative and not time measurements, so a distance of two means the device is twice as far from you as a device at distance of one. It should be somewhat correlated to time, but not rigorously so. You can read more about all this in Section 11.4 of the PMIx Standard, which talks about the server-side for computing the distances. You retrieve them using Any HWLOC released in the last 4 years is fine. You'll also need PMIx v4.0 or above, and PRRTE v3.0 to provide the data. HTH |
|
Thanks Ralph. This was very helpful. |
|
If you experiment with this, please let me know how it works for you - the distance algorithm really hasn't been well tested yet for usefulness (i.e., how well can you optimize in various scenarios based on it). I'd be happy to point you to where it is done in PMIx so you can tinker and improve upon it. |
|
@rhc54, will do. I'll definitely be trying it on our system. Pointers in the code would be helpful. I've been looking at PMIx_compute_distances(). Is that it? |
Yep, that's the one! |
|
@rhc54 happy new year. I'm back looking at this I ended up initializing the cpuset manually, as the pmix_cpuset_t is just a thin wrapper on top of hwloc bitmap, test code below. That worked. And I did get the distances but only from the process to the NICs. The results look correct based on our topology. However, should I be getting distances to other objects, like the GPUs for example seeing that I had that in the type bit field? I'll delve into this in more details, but any insight would be greatly appreciated. The above test code works regardless if I have the below parameter set or not. Which is good. But doesn't align with what you mentioned, regarding it being turned off by default. Another issue I ran into is if I do this: I get a segfault. I'll investigate and see if I can push up a fix |
I'm rather confused as to what you are trying to do. PRRTE computes the distances and includes them in the data provided to a process - all you have to do is access them via If you are trying to compute the distances yourself at the application level, then yes, you do have to get your topology. One way to do it is just to use
I honestly don't remember offhand where I left the implementation - could be that I didn't work thru the OR'd bit fields.
We won't recognize that parameter value - it's a boolean param, so the word "on" means nothing to it
That would be the more concerning issue as it means PRRTE is segfaulting when attempting to compute the distances. |
|
@rhc54 , The problem is with PMIx_Get_cpuset(). It crashes when I call it. Should I be able to use that API? The reason for the crash is: pmix_globals.topology.topology == NULL I call PMIx_Get_cpuset() from OMPI in the opal_common_ofi_select_provider() path. I'm trying to understand how pmix_globals.topology.topology is suppose to be initialized. |
|
Hmmm...interesting dilemma! Technically, yes you should be able to use that API. At a more practical level, I've only ever used it on the server side of things as that is where we compute distances and such. Looking at the code in |
|
@rhc54, thanks. And I think I'm starting to see the intended workflow. However, since the APIs are there and should be usable from the client as well, it seems like calling PMIx_Compute_distances() directly would avoid us having to rely on configuration, which may or may not be set. Thoughts? |
|
This (openpmix/openpmix#2907) should fix the dilemma.
There are tradeoffs, of course. You gain portability by doing it yourself as you don't know if PRRTE is around or not, and maybe the host environment isn't as kind as PRRTE. Only real negative is that you can't easily generate the distances for the remote side. Here is what I would recommend. Try to perform a Meantime, if you are finding this useful, then we should turn on the generate distances feature by default, and ensure that we are computing the distances for the devices of interest. Might encourage others to use them as well 😄 |
|
@rhc54, I have another more general topology question. |
By topology, do you mean the topology of the node? Or do you mean the binding of the process running on B? The answer to the latter is "yes" - we provide you with the binding/cpuset of all procs in the job. If you want the node's topology, that's more difficult to get. It is unlikely that the runtime will share full topologies around the job. However, PRRTE does do this, and so in that case it should be possible to retrieve it. I'd have to look to see the semantics for such a request. |
|
@rhc54 found the cause of the crash when specifying generate_dist variable: |
|
Hooray! Thanks! |
BTW: when attempting this "get", be sure to pass an "optional" directive to it. This will instruct pmix_info_t directive;
PMIX_INFO_LOAD(&directive, PMIX_OPTIONAL, NULL, PMIX_BOOL);
PMIx_Get(proc, key, &directive, 1, &value);
PMIX_INFO_DESTRUCT(&directive); |
|
@rhc54, going back to the segfault in pmix_server_register_fns.c:prte_pmix_server_register_nspace() darray is populated as follows It's missing the darray.type. This causes a segfault later on. I tried setting darray.type = PMIX_INFO. But I don't think that's right either. It looks like it ought to be set to PMIX_POINTER. But would like confirmation. |
|
It should be |
|
Note that the "info_list_add" code is going to make a copy of that array, and so there may be some issue in that code path. The array should consist of |
|
@rhc54 fixed the other crash I ran into: Also I'm confirming that PMIx_Get(distance) works correctly. It succeeds if the generate_distances is turned on, otherwise it fails. I'd like to turn on distances by default. Are you okay with that? Basically the default value will be "fabric,gpu,network" |
Absolutely - thanks! |
|
@rhc54, turned on distance generation by default: openpmix/prrte#1639 |
|
Afraid I haven't seen their entry before - perhaps it would help if you provided the full entry for this device? Sounds like they didn't follow convention. |
Can you post the part of lstopo -v that contains this OS device, its parent PCI dev, and the other OS devices below it? |
|
|
if NodeGUID is another infos should it appear in the Network L#0 (Address=...) "hsn2" line? |
|
Hmmm...I wonder if the issue here is that they are identified as an "Ethernet" device instead of an "OpenFabrics" device? An Ethernet device wouldn't have GUIDs associated with it - just an IPv4/6 address. I'll bet that is the problem. What are you seeing in the device_dist struct? I'm wondering if we just aren't constructing the struct info correctly, or perhaps misidentifying the device type. |
|
Oh, wait, this is Cray device? Cray has never answered any of my request for information or access to hardware. Hence I don't know where to get Cray-specific information such as GUID, etc. So there's no hwloc support at all, you only see basic Network (Ethernet) information. |
|
the slingshot11 nics are basically ethernet cards with some bells and whistles. |
|
Perhaps someone can provide Brice with access to a machine so he can try to improve the HWLOC support? I know I tried to get it for him back in my Intel days, but left before we ever say a device. |
|
Can you post the output of a libfabric tool that shows device attributes, addresses, etc for that hsn device? I'd like to get an idea of what hwloc could expose. |
|
contacting Brice about this option offline |
|
which option? |
|
getting him an account on a box with ss11 nics |
|
Looks like there isn't much useful in the libfabric output. The address doesn't even look similar to what hwloc is seeing (maybe GDB doesn't know how to display it, is it a variable length array?) |
|
The reason why I'm going through this exercise at the moment is to be able to convert the current process/NIC binding code in OMPI to use distances. And make the code generic (not specific to the frontier platform).
Sound reasonable? |
Address is written as follows: So I don't think that gdb is mis-interpreting it |
|
I suspect the problem is that libfabric is obtaining the info from HWLOC, which has no real access (at this point) to the Cray information because they haven't shared it. I doubt we can resolve how to use device_dist until HWLOC can support this hardware, so @hppritcha is likely taking the right path. |
|
@amirshehataornl One thing you could investigate is if PMIx is correctly identifying this device as a "network" vs "openfabric" device. If so, then it should be providing the IP address as the UUID. Of course, that won't help if the address being reported by HWLOC differs from that you are getting from the Cray OFI provider. Maybe you should look at that provider and see where it is getting the address? Could help explain the difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think that should work - but only for the Cray and other Ethernet providers. I realize that covers the range of your interest, but is this going to cause a problem for OpenFabric providers?
opal/mca/common/ofi/common_ofi.c
Outdated
| distances = (pmix_device_distance_t*)dptr->array; | ||
|
|
||
| for (i = 0; i < dptr->size; i++) | ||
| fprintf(stderr, "%d: %d:%s:%d:%d\n", getpid(), i, distances[i].uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shame on me - I missed the pmix_device_distances_t struct when providing "pretty-print" functions. I'll address that soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand why this wouldn't work for other HW?
opal/mca/common/ofi/common_ofi.c
Outdated
| + strlen(distances[i].uuid)) | ||
| continue; | ||
| if (!strcmp(addr+3, address)) { | ||
| fprintf(stderr, "%d matched distance addr %s with %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Cray OFI provider's IP address matches that reported by HWLOC, then this should work (or so I expect).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I'm going to remove the fprintfs. These slipped in from my debugging
The basic approach would work, just not the mechanics you have here. The problem lies in the precise attributes provided for each class of device. Ethernet devices have IP addresses associated with them - since the Cray SS11 is identified in that class, you can compare the PMIx UUID with the "address" attribute. These devices are classed as However, Infiniband devices are classed as You'll find the same to be true of GPUs and other device types - see the PMIx device distance code for more details: if (HWLOC_OBJ_OSDEV_NETWORK == table[n].hwtype) {
char *addr = NULL;
/* find the address */
for (i = 0; i < device->infos_count; i++) {
if (0 == strcasecmp(device->infos[i].name, "Address")) {
addr = device->infos[i].value;
break;
}
}
if (NULL == addr) {
/* couldn't find an address - report it as an error */
PMIX_LIST_DESTRUCT(&dists);
return PMIX_ERROR;
}
/* could be IPv4 or IPv6 */
cnt = countcolons(addr);
if (5 == cnt) {
pmix_asprintf(&d->dist.uuid, "ipv4://%s", addr);
} else if (19 == cnt) {
pmix_asprintf(&d->dist.uuid, "ipv6://%s", addr);
} else {
/* unknown address type */
PMIX_LIST_DESTRUCT(&dists);
return PMIX_ERROR;
}
} else if (HWLOC_OBJ_OSDEV_OPENFABRICS == table[n].hwtype) {
char *ngid = NULL;
char *sgid = NULL;
/* find the UIDs */
for (i = 0; i < device->infos_count; i++) {
if (0 == strcasecmp(device->infos[i].name, "NodeGUID")) {
ngid = device->infos[i].value;
} else if (0 == strcasecmp(device->infos[i].name, "SysImageGUID")) {
sgid = device->infos[i].value;
}
}
if (NULL == ngid || NULL == sgid) {
PMIX_LIST_DESTRUCT(&dists);
return PMIX_ERROR;
}
pmix_asprintf(&d->dist.uuid, "fab://%s::%s", ngid, sgid);
} else if (HWLOC_OBJ_OSDEV_GPU == table[n].hwtype) {
/* if the name starts with "card", then this is just the aux card of the GPU */
if (0 == strncasecmp(device->name, "card", 4)) {
pmix_list_remove_item(&dists, &d->super);
PMIX_RELEASE(d);
device = hwloc_get_next_osdev(topo->topology, device);
continue;
}
pmix_asprintf(&d->dist.uuid, "gpu://%s::%s", pmix_globals.hostname,
device->name);
} else {
/* unknown type */
pmix_list_remove_item(&dists, &d->super);
PMIX_RELEASE(d);
device = hwloc_get_next_osdev(topo->topology, device);
continue;
} |
|
Makes sense. Keeping in mind that this function is specifically targeting network interfaces; IE find the nearest networking/openfabric device to the process, then it should be enough to check the type of the device and basically do the logic in the HWLOC_OBJ_OSDEV_OPENFABRICS and HWLOC_OBJ_OSDEV_NETWORK. We don't need to handle GPUs in this function. My goal is to land this patch in OMPI, so we don't have to track it separately. |
|
Yes - if you handle the two cases, you should be fine. |
The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false. A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it. Use the PMIx distance generation for this purpose. Signed-off-by: Amir Shehata <[email protected]>
|
I don't see anything wrong but I don't know well what PMIx and libfabric report here, hence it's impossible to be sure of what's going to happen during your tests. |
|
I'm good with it - I believe it will do what you seek. Will obviously need exposure to various environments to be sure. |
|
FWIW: the error is due to Java failures to pull the repos. I believe there has been a change to the CI since this PR was originally created, so you may need to do a rebase against |
The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.
A better way of determining if the PCI object matches a process CPU set is to use the NUMA node as the common denominator.
Find all NUMA nodes on the system. Find which one intersects with the process' cpu set and then determine if this NUMA node intersects with the non-IO ancestor node list. If both these conditions are true then the PCI devices matches the process.
Another change this patch brings is using HWLOC_CPUBIND_THREAD instead of HWLOC_CPUBIND_PROCESS when finding the cpuset of the process. There are cases, where a process is initially bound to a set of CPUs, but on initialization the process can spawn more threads (some of which can be spawned by 3rd party libraries). These threads can be explicitly affined differently from the initial binding.
One example of that is the HSA library. It spawns a thread upon initialization and explicitly binds it to all available CPUs. If we use HWLOC_CPUBIND_PROCESS, then this will result in less than performant process to NIC binding.
It is safer to make the assumption that the thread which is currently running this code has the correct binding and base the algorithm off that.
Signed-off-by: Amir Shehata [email protected]