mtl/ofi: NIC selection update #11206

amirshehataornl · 2022-12-13T17:55:31Z

The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.

A better way of determining if the PCI object matches a process CPU set is to use the NUMA node as the common denominator.

Find all NUMA nodes on the system. Find which one intersects with the process' cpu set and then determine if this NUMA node intersects with the non-IO ancestor node list. If both these conditions are true then the PCI devices matches the process.

Another change this patch brings is using HWLOC_CPUBIND_THREAD instead of HWLOC_CPUBIND_PROCESS when finding the cpuset of the process. There are cases, where a process is initially bound to a set of CPUs, but on initialization the process can spawn more threads (some of which can be spawned by 3rd party libraries). These threads can be explicitly affined differently from the initial binding.

One example of that is the HSA library. It spawns a thread upon initialization and explicitly binds it to all available CPUs. If we use HWLOC_CPUBIND_PROCESS, then this will result in less than performant process to NIC binding.

It is safer to make the assumption that the thread which is currently running this code has the correct binding and base the algorithm off that.

Signed-off-by: Amir Shehata [email protected]

ompiteam-bot · 2022-12-13T17:55:35Z

Can one of the admins verify this patch?

rhc54 · 2022-12-13T18:00:23Z

I would caution that NUMA is no longer a well-defined entity in today's multi-chip packages - you may find that this proposed change behaves in an unexpected manner on other architectures.

amirshehataornl · 2022-12-13T18:13:01Z

I would caution that NUMA is no longer a well-defined entity in today's multi-chip packages - you may find that this proposed change behaves in an unexpected manner on other architectures.

@rhc54, The other option is to use the PMIx distance table you pointed me to in the comment on the issue I opened. You think that would be more reliable? Also if you can give me some initial pointers of where I can see how that is used, that'll be helpful. Thanks

naughtont3 · 2022-12-13T19:05:02Z

@rhc54 Also, do you recommend a minimum hwloc version that we should use?

hppritcha · 2022-12-13T20:40:44Z

bot:ok to test

hppritcha · 2022-12-13T20:42:59Z

please make sure to test this PR with a singleton test

rhc54 · 2022-12-13T22:36:36Z

please make sure to test this PR with a singleton test

By definition, singleton's in OMPI v5 and above are not bound to anything as all the "self-bind at startup" code has been removed. It is therefore impossible to define a "distance" for them.

True, you could bind the singleton at some point - but that is likely a mistake. First, you have no idea what the user would want in terms of binding. Second, any binding that is done will be picked up by PRRTE if the singleton calls "comm_spawn" and used as a constraining window for all subsequent process starts. Probably not what anyone would really want.

rhc54 · 2022-12-13T23:03:35Z

@amirshehataornl @naughtont3

Let me answer your questions by starting with a little history. In my former life at Intel (back about 4-5 years ago), I setup a grant for Brice (of HWLOC fame) to work with us on topology discovery/description. We were developing rather complex multi-chip packages and seeing performance issues that suspiciously looked like they were topology related - i.e., the codes weren't properly understanding the MCP topology and mapping/binding in sub-optimal ways.

Long story short, the problem was that the codes were trying to operate on the basis of "NUMA" domains - yet the NUMA concept wasn't that clear in these systems. There was memory on each CPU die (with multiple of those in a package), additional memory inside the package (divided between the CPUs by the BIOS - and therefore alterable), memory on the PCIe bus, memory in the GPUs and fabric interfaces, etc - with the type of each of those memories being potentially different (DRAM, NVRAM, etc.). All of that memory was now potentially in the address space of the CPUs in the package - so when you asked to "bind to NUMA", precisely which NUMA(s) are you referring to? Far from clear.

This is why you won't find NUMA-to-NUMA distance matrices in the HWLOC topology any more, at least in most cases. Really not very useful. It's also why I keep nagging people over here to update their NIC selection logic and stop using NUMA - but that's another battle.

What we are doing in its place is providing every process with a complete distance map of its location to every device of interest on your local node. You can then do with it what you like. We also provide each process with the map for every other process in the job, so you can (for example) know that rank 17 is going to select NIC 3 on its node as that is the closest to it, and then decide to use your NIC 2 to connect to it. At some point in the near future, we plan to add switch info so you can see which NICs share a switch, and thus potentially optimize collectives.

PRRTE has been providing the distance info for at least 2 years now. It is currently "off" by default, governed by the PRRTE MCA parameter "pmix_generate_distances":

    (void) pmix_mca_base_var_register("prte", "pmix", NULL, "generate_distances",
                                      "Device types whose distances are to be provided (default=none, options=fabric,gpu,network",
                                      PMIX_MCA_BASE_VAR_TYPE_BOOL,
                                      &generate_dist);
    prte_pmix_server_globals.generate_dist = 0;
    if (NULL != generate_dist) {
        tmp = PMIX_ARGV_SPLIT_COMPAT(generate_dist, ',');
        for (i=0; NULL != tmp[i]; i++) {
            if (0 == strcasecmp(tmp[i], "fabric")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_OPENFABRICS;
            } else if (0 == strcasecmp(tmp[i], "gpu")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_GPU;
            } else if (0 == strcasecmp(tmp[i], "network")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_NETWORK;
            }
        }
    }

We can add other device types fairly easily. Note that Brice and I spent a fair amount of time coming up with a reasonable "distance" metric for computing these values. They are purely relative and not time measurements, so a distance of two means the device is twice as far from you as a device at distance of one. It should be somewhat correlated to time, but not rigorously so.

You can read more about all this in Section 11.4 of the PMIx Standard, which talks about the server-side for computing the distances. You retrieve them using PMIx_Get, passing the ID of the proc of interest with the PMIX_DEVICE_DISTANCES key. You'll get an array of distances, one for each device. I need to add the ability to "qualify" that request to only return the devices of interest.

Any HWLOC released in the last 4 years is fine. You'll also need PMIx v4.0 or above, and PRRTE v3.0 to provide the data.

HTH
Ralph

amirshehataornl · 2022-12-14T00:39:30Z

Thanks Ralph. This was very helpful.

rhc54 · 2022-12-14T01:26:54Z

If you experiment with this, please let me know how it works for you - the distance algorithm really hasn't been well tested yet for usefulness (i.e., how well can you optimize in various scenarios based on it). I'd be happy to point you to where it is done in PMIx so you can tinker and improve upon it.

amirshehataornl · 2022-12-14T14:25:41Z

@rhc54, will do. I'll definitely be trying it on our system. Pointers in the code would be helpful. I've been looking at PMIx_compute_distances(). Is that it?

rhc54 · 2022-12-14T14:41:05Z

I've been looking at PMIx_compute_distances(). Is that it?

Yep, that's the one!

amirshehataornl · 2023-01-11T01:42:04Z

@rhc54 happy new year. I'm back looking at this
In order to get the distances from the current processes to the NICs, you need to get the cpu binding of the process. PMIx provides PMIx_Get_cpuset(). This API requires pmix_globals.topology to have been initialized. pmix_globals.topology is initialized in PMIx_Init(). Looking at PMIx_Init(), it doesn't look like it initializes pmix_globals.topology unless you pass it an info structure, initialized with the topology. And in order to get the PMIx topology, you need to have initialized PMIx. There is a loop here, and not sure how it could be broken. I thought of calling PMIx_Init() twice, but the second time it exits early because it has already been initialized. Is there a way to initialize pmix_globals.topology independently? What am I missing?

I ended up initializing the cpuset manually, as the pmix_cpuset_t is just a thin wrapper on top of hwloc bitmap, test code below.

444 static void pmix_test(struct fi_pci_attr pci, hwloc_bitmap_t proc_cpuset)                         
445 {  
446 »···int rc;                                                                                       
447 »···pmix_topology_t *topo;                                                                        
448 »···pmix_proc_t this_proc;
449 »···pmix_cpuset_t cpuset;
450 »···pmix_info_t *info;                                                                            
451 »···size_t ninfo;
452 »···pmix_device_distance_t *distances;                                                            
453 »···size_t ndist;
454 »···pmix_device_type_t type = PMIX_DEVTYPE_OPENFABRICS |                                          
455 »···  PMIX_DEVTYPE_NETWORK |
456 »···  PMIX_DEVTYPE_COPROC |                                                                       
457 »···  PMIX_DEVTYPE_GPU |                                                                          
458 »···  PMIX_DEVTYPE_BLOCK |                                                                        
459 »···  PMIX_DEVTYPE_DMA;
460
461 »···PMIX_TOPOLOGY_CREATE(topo, 8);                                                                
462 »···/* TODO problem with PMIx_Get_cpuset() */                                                     
463
464 »···memset(&cpuset, 0, sizeof(cpuset));                                                           
465 »···cpuset.source = strdup("hwloc");
466 »···cpuset.bitmap = proc_cpuset;                                                                  
467
468 »···rc = PMIx_Load_topology(topo);
469 »···if (rc) {
470 »···»···fprintf(stderr, "failure loading pmix topology\n");                                       
471 »···»···return;
472 »···}
473
474 »···ninfo = 1;
475 »···PMIX_INFO_CREATE(info, ninfo);                                                                
476 »···PMIX_INFO_LOAD(&info[0], PMIX_DEVICE_TYPE, &type, PMIX_DEVTYPE);
477 »···rc = PMIx_Compute_distances(topo, &cpuset, info, ninfo, &distances,
478 »···»···»···»···»···»···»···»···&ndist);
479 »···PMIX_INFO_FREE(info, ninfo);
480 »···if (PMIX_SUCCESS != rc) {
481 »···»···fprintf(stderr, "failed to compute distances\n");                                         
482 »···»···return;
483 »···}
484 }

That worked. And I did get the distances but only from the process to the NICs. The results look correct based on our topology.

(gdb) p distances[0]
$4 = {uuid = 0x7fffc4014550 "ipv4://02:00:00:00:09:b3", osname = 0x7fffc4014580 "hsn2", type = 4, mindist = 6, maxdist = 6}
(gdb) p distances[1]
$5 = {uuid = 0x7fffc40145a0 "ipv4://00:40:a6:86:31:15", osname = 0x7fffc40145d0 "ens2", type = 4, mindist = 6, maxdist = 6}
(gdb) p distances[2]
$6 = {uuid = 0x7fffc40145f0 "ipv4://02:00:00:00:09:f3", osname = 0x7fffc4014620 "hsn1", type = 4, mindist = 7, maxdist = 7}
(gdb) p distances[3]
$7 = {uuid = 0x7fffc4014640 "ipv4://02:00:00:00:09:b2", osname = 0x7fffc4014670 "hsn3", type = 4, mindist = 7, maxdist = 7}
(gdb) p distances[4]
$8 = {uuid = 0x7fffc4014690 "ipv4://02:00:00:00:09:f2", osname = 0x7fffc40146c0 "hsn0", type = 4, mindist = 7, maxdist = 7}

However, should I be getting distances to other objects, like the GPUs for example seeing that I had that in the type bit field? I'll delve into this in more details, but any insight would be greatly appreciated.

The above test code works regardless if I have the below parameter set or not. Which is good. But doesn't align with what you mentioned, regarding it being turned off by default.

--prtemca pmix_generate_distances on

Another issue I ran into is if I do this:

--prtemca pmix_generate_distances 1

I get a segfault. I'll investigate and see if I can push up a fix

rhc54 · 2023-01-11T15:22:10Z

There is a loop here, and not sure how it could be broken.

I'm rather confused as to what you are trying to do. PRRTE computes the distances and includes them in the data provided to a process - all you have to do is access them via PMIx_Get.

If you are trying to compute the distances yourself at the application level, then yes, you do have to get your topology. One way to do it is just to use PMIx_Get and ask PMIx to provide you with a copy of the topology so you can use it.

However, should I be getting distances to other objects, like the GPUs for example seeing that I had that in the type bit field?

I honestly don't remember offhand where I left the implementation - could be that I didn't work thru the OR'd bit fields.

--prtemca pmix_generate_distances on

We won't recognize that parameter value - it's a boolean param, so the word "on" means nothing to it

--prtemca pmix_generate_distances 1 generates a segfault

That would be the more concerning issue as it means PRRTE is segfaulting when attempting to compute the distances.

amirshehataornl · 2023-01-11T16:43:09Z

@rhc54 , The problem is with PMIx_Get_cpuset(). It crashes when I call it. Should I be able to use that API?

The reason for the crash is:

PMIx_Get_cpuset()
 -> pmix_hwloc_get_cpuset()
   -> hwloc_get_cpubind(pmix_globals.topology.topology, cpuset->bitmap, flag);

pmix_globals.topology.topology == NULL

I call PMIx_Get_cpuset() from OMPI in the opal_common_ofi_select_provider() path.

I'm trying to understand how pmix_globals.topology.topology is suppose to be initialized.

rhc54 · 2023-01-11T16:56:44Z

Hmmm...interesting dilemma! Technically, yes you should be able to use that API. At a more practical level, I've only ever used it on the server side of things as that is where we compute distances and such.

Looking at the code in pmix_client_topology.c and pmix_hwloc.c, I can see where the bug lies. Let me post a fix for that part of the problem.

amirshehataornl · 2023-01-11T17:32:31Z

@rhc54, thanks.

And I think I'm starting to see the intended workflow.
The server side calls PMIx_Compute_distances() from prte_pmix_server_register_nspace() if the pmix_generate_distances is set.
Then from the client side you use PMIx_Get with PMIX_DEVICE_DISTANCES to grab the distances.

However, since the APIs are there and should be usable from the client as well, it seems like calling PMIx_Compute_distances() directly would avoid us having to rely on configuration, which may or may not be set. Thoughts?

rhc54 · 2023-01-11T17:37:31Z

This (openpmix/openpmix#2907) should fix the dilemma.

PMIx_Compute_distances() directly would avoid us having to rely on configuration, which may or may not be set. Thoughts?

There are tradeoffs, of course. You gain portability by doing it yourself as you don't know if PRRTE is around or not, and maybe the host environment isn't as kind as PRRTE. Only real negative is that you can't easily generate the distances for the remote side.

Here is what I would recommend. Try to perform a PMIx_Get for what you need. If it doesn't succeed, then go ahead and compute it yourself. The "get" call is pretty cheap once the new shmem subsystem is enabled, so it is probably worth trying.

Meantime, if you are finding this useful, then we should turn on the generate distances feature by default, and ensure that we are computing the distances for the devices of interest. Might encourage others to use them as well 😄

amirshehataornl · 2023-01-11T18:22:04Z

@rhc54, I have another more general topology question.
PMIx is able to gather the topology of processes on all nodes? or just on the local node?
Let's say my job has two nodes, A and B and each node has 1 process running on it. Can the process running on A find the topology of the process running on B?

rhc54 · 2023-01-11T18:28:33Z

@rhc54, I have another more general topology question. PMIx is able to gather the topology of processes on all nodes? or just on the local node? Let's say my job has two nodes, A and B and each node has 1 process running on it. Can the process running on A find the topology of the process running on B?

By topology, do you mean the topology of the node? Or do you mean the binding of the process running on B?

The answer to the latter is "yes" - we provide you with the binding/cpuset of all procs in the job.

If you want the node's topology, that's more difficult to get. It is unlikely that the runtime will share full topologies around the job. However, PRRTE does do this, and so in that case it should be possible to retrieve it. I'd have to look to see the semantics for such a request.

amirshehataornl · 2023-01-11T20:23:02Z

@rhc54 found the cause of the crash when specifying generate_dist variable:
openpmix/prrte#1637

rhc54 · 2023-01-11T20:34:42Z

Hooray! Thanks!

rhc54 · 2023-01-11T21:09:36Z

Here is what I would recommend. Try to perform a PMIx_Get for what you need. If it doesn't succeed, then go ahead and compute it yourself. The "get" call is pretty cheap once the new shmem subsystem is enabled, so it is probably worth trying.

BTW: when attempting this "get", be sure to pass an "optional" directive to it. This will instruct PMIx_Get to just check its local cache for the value and return if not found. Otherwise, it will poke the local server to try and retrieve it, and that gets more expensive. Looks like this:

pmix_info_t directive;

PMIX_INFO_LOAD(&directive, PMIX_OPTIONAL, NULL, PMIX_BOOL);
PMIx_Get(proc, key, &directive, 1, &value);
PMIX_INFO_DESTRUCT(&directive);

amirshehataornl · 2023-01-11T21:26:08Z

@rhc54, going back to the segfault
I'm still running into a problem.
when calling

PMIX_INFO_LIST_ADD(ret, pmap, PMIX_DEVICE_DISTANCES, &darray, PMIX_DATA_ARRAY)

in pmix_server_register_fns.c:prte_pmix_server_register_nspace()

darray is populated as follows

darray.array = distances;
darray.size = ndist;

It's missing the darray.type. This causes a segfault later on.

I tried setting darray.type = PMIX_INFO.

But I don't think that's right either.

It looks like it ought to be set to PMIX_POINTER.

But would like confirmation.

rhc54 · 2023-01-11T21:36:01Z

It should be PMIX_DEVICE_DIST

rhc54 · 2023-01-11T21:38:19Z

Note that the "info_list_add" code is going to make a copy of that array, and so there may be some issue in that code path. The array should consist of ndist number of pmix_device_distance_t structures.

amirshehataornl · 2023-01-11T22:50:35Z

@rhc54 fixed the other crash I ran into:
openpmix/prrte#1638

Also I'm confirming that PMIx_Get(distance) works correctly. It succeeds if the generate_distances is turned on, otherwise it fails.

I'd like to turn on distances by default. Are you okay with that? Basically the default value will be "fabric,gpu,network"

rhc54 · 2023-01-11T23:27:51Z

I'd like to turn on distances by default. Are you okay with that? Basically the default value will be "fabric,gpu,network"

Absolutely - thanks!

amirshehataornl · 2023-01-12T19:31:18Z

@rhc54, turned on distance generation by default: openpmix/prrte#1639
Next step is update this patch to use distances instead of NUMA binding to determine nearest interface.

rhc54 · 2023-01-13T19:24:10Z

Afraid I haven't seen their entry before - perhaps it would help if you provided the full entry for this device? Sounds like they didn't follow convention.

bgoglin · 2023-01-13T19:51:25Z

@bgoglin, @rhc54, thanks for the pointers. I'm gdbing through these structures and the osdev doesn't have a node "NodeGUID" infos entry, just the "Address" one. Is that expected? Should we be parsing out the NodeGUID and the SystemImageGUID from the address?
infos = 0x7fffc41cb4c0,
infos_count = 1,

(gdb) p *obj->io_first_child->infos
$12 = { 
  name = 0x7fffc41cb550 "Address",
  value = 0x7fffc41cb570 "02:00:00:00:12:e2"

Can you post the part of lstopo -v that contains this OS device, its parent PCI dev, and the other OS devices below it?

amirshehataornl · 2023-01-13T20:10:08Z

        HostBridge L#2 (buses=0000:[d4-d6])
          PCIBridge L#3 (busid=0000:d4:01.1 id=1022:14be class=0604(PCIBridge) link=31.51GB/s buses=0000:[d5-d5])
            PCI L#1 (busid=0000:d5:00.0 id=17db:0501 class=0200(Ethernet) link=31.51GB/s)
              Network L#0 (Address=02:00:00:00:12:a3) "hsn2"
          PCIBridge L#4 (busid=0000:d4:07.1 id=1022:14bf class=0604(PCIBridge) link=31.51GB/s buses=0000:[d6-d6])
            PCI L#2 (busid=0000:d6:00.0 id=1002:7408 class=0380(Display) link=31.51GB/s)

amirshehataornl · 2023-01-13T20:15:26Z

if NodeGUID is another infos should it appear in the Network L#0 (Address=...) "hsn2" line?
I also did lstopo -v on my local machine and I don't see except the Address entry in the Network line. But not sure if that's just cause that's the only thing which is printed

rhc54 · 2023-01-13T20:17:35Z

Hmmm...I wonder if the issue here is that they are identified as an "Ethernet" device instead of an "OpenFabrics" device? An Ethernet device wouldn't have GUIDs associated with it - just an IPv4/6 address. I'll bet that is the problem.

What are you seeing in the device_dist struct? I'm wondering if we just aren't constructing the struct info correctly, or perhaps misidentifying the device type.

bgoglin · 2023-01-13T20:18:37Z

Oh, wait, this is Cray device? Cray has never answered any of my request for information or access to hardware. Hence I don't know where to get Cray-specific information such as GUID, etc. So there's no hwloc support at all, you only see basic Network (Ethernet) information.

hppritcha · 2023-01-13T20:19:08Z

the slingshot11 nics are basically ethernet cards with some bells and whistles.

rhc54 · 2023-01-13T20:26:59Z

Perhaps someone can provide Brice with access to a machine so he can try to improve the HWLOC support? I know I tried to get it for him back in my Intel days, but left before we ever say a device.

bgoglin · 2023-01-13T20:28:44Z

Can you post the output of a libfabric tool that shows device attributes, addresses, etc for that hsn device? I'd like to get an idea of what hwloc could expose.

hppritcha · 2023-01-13T20:31:41Z

contacting Brice about this option offline

amirshehataornl · 2023-01-13T20:32:45Z

(gdb) p current_provider->nic
$1 = (struct fid_nic *) 0xcfed00
(gdb) p *current_provider->nic
$2 = {
  fid = {
    fclass = 20,
    context = 0x0,
    ops = 0x7fffe2724c80 <default_nic_ops>
  },
  device_attr = 0xcfed40,
  bus_attr = 0xcfee20,
  link_attr = 0xcfee40,
  prov_attr = 0x0
}
(gdb) p *current_provider->nic->device_attr
$3 = {
  name = 0xcfed80 "cxi0",
  device_id = 0xcfeda0 "0x501",
  device_version = 0xcfedc0 "2",
  vendor_id = 0xcfede0 "0x17db",
  driver = 0xcfee00 "cxi_core",
  firmware = 0x0
}
(gdb) p *current_provider->nic->bus_attr
$4 = {
  bus_type = FI_BUS_PCI,
  attr = {
    pci = {
      domain_id = 0,
      bus_id = 197 '\305',
      device_id = 0 '\000',
      function_id = 0 '\000'
    }
  }
}
(gdb) p *current_provider->nic->link_attr
$5 = {
  address = 0xcfee70 "0x12e2",
  mtu = 2112,
  speed = 200000000,
  state = FI_LINK_UP,
  network_type = 0xcfee90 "HPC Ethernet"
}

amirshehataornl · 2023-01-13T20:33:24Z

contacting Brice about this option offline

which option?

hppritcha · 2023-01-13T20:34:23Z

getting him an account on a box with ss11 nics

bgoglin · 2023-01-13T20:36:13Z

Looks like there isn't much useful in the libfabric output. The address doesn't even look similar to what hwloc is seeing (maybe GDB doesn't know how to display it, is it a variable length array?)

amirshehataornl · 2023-01-13T20:45:46Z

The reason why I'm going through this exercise at the moment is to be able to convert the current process/NIC binding code in OMPI to use distances. And make the code generic (not specific to the frontier platform).
Basically, with what's available at the moment, here are my thoughts:

Get the device distances from current process
Identify the device(s) nearest the process. Could be more than 1 if the distances are the same
Find the matching fi_info blocks:
3a. Go through the fi_info block list returned by fi_getinfo() and use the pci information to look up the pci object
3b. Iterate through the osdevs related to the pci object (which could be multiple of them) and compare the address, with the address in the distance structure. The assumption here is if there are multiple os objects, each one will have a unique address.
3c. matching fi_info is added to a list
if there is only one fi_info in the list, then select that one
otherwise use the process rank to select one.

Sound reasonable?

amirshehataornl · 2023-01-13T20:54:29Z

Looks like there isn't much useful in the libfabric output. The address doesn't even look similar to what hwloc is seeing (maybe GDB doesn't know how to display it, is it a variable length array?)

struct fi_link_attr {
	char			*address;
	size_t			mtu;
	size_t			speed;
	enum fi_link_state	state;
	char			*network_type;
};

struct fi_bus_attr {
	enum fi_bus_type	bus_type;
	union {
		struct fi_pci_attr	pci;
	} attr;
};

struct fi_device_attr {
	char			*name;
	char			*device_id;
	char			*device_version;
	char			*vendor_id;
	char			*driver;
	char			*firmware;
};

struct fid_nic {
	struct fid		fid;
	struct fi_device_attr	*device_attr;
	struct fi_bus_attr	*bus_attr;
	struct fi_link_attr	*link_attr;
	void			*prov_attr;
};

Address is written as follows:

	ret = asprintf(&fi->nic->link_attr->address, "0x%x",
		       nic_if->info->nic_addr);

So I don't think that gdb is mis-interpreting it

rhc54 · 2023-01-13T21:27:48Z

I suspect the problem is that libfabric is obtaining the info from HWLOC, which has no real access (at this point) to the Cray information because they haven't shared it. I doubt we can resolve how to use device_dist until HWLOC can support this hardware, so @hppritcha is likely taking the right path.

rhc54 · 2023-01-13T21:29:38Z

@amirshehataornl One thing you could investigate is if PMIx is correctly identifying this device as a "network" vs "openfabric" device. If so, then it should be providing the IP address as the UUID. Of course, that won't help if the address being reported by HWLOC differs from that you are getting from the Cray OFI provider.

Maybe you should look at that provider and see where it is getting the address? Could help explain the difference.

amirshehataornl · 2023-01-14T01:54:33Z

@rhc54, @bgoglin I updated the patch, can you guys take a look and let me know what you think of this approach.

rhc54

Overall, I think that should work - but only for the Cray and other Ethernet providers. I realize that covers the range of your interest, but is this going to cause a problem for OpenFabric providers?

rhc54 · 2023-01-14T02:37:17Z

opal/mca/common/ofi/common_ofi.c

+    distances = (pmix_device_distance_t*)dptr->array;
+
+	for (i = 0; i < dptr->size; i++)
+		fprintf(stderr, "%d: %d:%s:%d:%d\n", getpid(), i, distances[i].uuid,


Shame on me - I missed the pmix_device_distances_t struct when providing "pretty-print" functions. I'll address that soon.

Can you expand why this wouldn't work for other HW?

rhc54 · 2023-01-14T02:39:19Z

opal/mca/common/ofi/common_ofi.c

+				+ strlen(distances[i].uuid))
+				continue;
+			if (!strcmp(addr+3, address)) {
+				fprintf(stderr, "%d matched distance addr %s with %s\n",


If the Cray OFI provider's IP address matches that reported by HWLOC, then this should work (or so I expect).

sorry, I'm going to remove the fprintfs. These slipped in from my debugging

rhc54 · 2023-01-14T14:25:45Z

Can you expand why this wouldn't work for other HW?

The basic approach would work, just not the mechanics you have here. The problem lies in the precise attributes provided for each class of device. Ethernet devices have IP addresses associated with them - since the Cray SS11 is identified in that class, you can compare the PMIx UUID with the "address" attribute. These devices are classed as network devices.

However, Infiniband devices are classed as OpenFabric devices. If they have been configured to support IP-over-IB, then they will have a port that is shown as a network device and has an IP address - but that isn't so common. They will also have ports identified as OpenFabric devices - this is the IB network connection. These ports will have NodeGUID and SystemGUID attributes (not address attributes), and so you would have to compare the PMIx UUID with those attributes.

You'll find the same to be true of GPUs and other device types - see the PMIx device distance code for more details:

                if (HWLOC_OBJ_OSDEV_NETWORK == table[n].hwtype) {
                    char *addr = NULL;
                    /* find the address */
                    for (i = 0; i < device->infos_count; i++) {
                        if (0 == strcasecmp(device->infos[i].name, "Address")) {
                            addr = device->infos[i].value;
                            break;
                        }
                    }
                    if (NULL == addr) {
                        /* couldn't find an address - report it as an error */
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                    /* could be IPv4 or IPv6 */
                    cnt = countcolons(addr);
                    if (5 == cnt) {
                        pmix_asprintf(&d->dist.uuid, "ipv4://%s", addr);
                    } else if (19 == cnt) {
                        pmix_asprintf(&d->dist.uuid, "ipv6://%s", addr);
                    } else {
                        /* unknown address type */
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                } else if (HWLOC_OBJ_OSDEV_OPENFABRICS == table[n].hwtype) {
                    char *ngid = NULL;
                    char *sgid = NULL;
                    /* find the UIDs */
                    for (i = 0; i < device->infos_count; i++) {
                        if (0 == strcasecmp(device->infos[i].name, "NodeGUID")) {
                            ngid = device->infos[i].value;
                        } else if (0 == strcasecmp(device->infos[i].name, "SysImageGUID")) {
                            sgid = device->infos[i].value;
                        }
                    }
                    if (NULL == ngid || NULL == sgid) {
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                    pmix_asprintf(&d->dist.uuid, "fab://%s::%s", ngid, sgid);
                } else if (HWLOC_OBJ_OSDEV_GPU == table[n].hwtype) {
                    /* if the name starts with "card", then this is just the aux card of the GPU */
                    if (0 == strncasecmp(device->name, "card", 4)) {
                        pmix_list_remove_item(&dists, &d->super);
                        PMIX_RELEASE(d);
                        device = hwloc_get_next_osdev(topo->topology, device);
                        continue;
                    }
                    pmix_asprintf(&d->dist.uuid, "gpu://%s::%s", pmix_globals.hostname,
                                  device->name);
                } else {
                    /* unknown type */
                    pmix_list_remove_item(&dists, &d->super);
                    PMIX_RELEASE(d);
                    device = hwloc_get_next_osdev(topo->topology, device);
                    continue;
                }

amirshehataornl · 2023-01-14T17:43:53Z

Makes sense. Keeping in mind that this function is specifically targeting network interfaces; IE find the nearest networking/openfabric device to the process, then it should be enough to check the type of the device and basically do the logic in the HWLOC_OBJ_OSDEV_OPENFABRICS and HWLOC_OBJ_OSDEV_NETWORK.
HWLOC_OBJ_OSDEV_OPENFABRICS: check NodeGUID and SysImageGUID
HWLOC_OBJ_OSDEV_NETWORK: check Address

We don't need to handle GPUs in this function.
Do you agree this would be sufficient?

My goal is to land this patch in OMPI, so we don't have to track it separately.

rhc54 · 2023-01-14T17:52:00Z

Yes - if you handle the two cases, you should be fine.

The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false. A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it. Use the PMIx distance generation for this purpose. Signed-off-by: Amir Shehata <[email protected]>

amirshehataornl · 2023-01-20T20:37:20Z

@rhc54, @bgoglin, would be able to take a look at this patch and let me know if you're good with it, or if you'd like to see further changes?

bgoglin · 2023-01-20T22:38:20Z

I don't see anything wrong but I don't know well what PMIx and libfabric report here, hence it's impossible to be sure of what's going to happen during your tests.

rhc54 · 2023-01-21T15:40:30Z

I'm good with it - I believe it will do what you seek. Will obviously need exposure to various environments to be sure.

rhc54 · 2023-01-21T15:42:44Z

FWIW: the error is due to Java failures to pull the repos. I believe there has been a change to the CI since this PR was originally created, so you may need to do a rebase against main to get the right CI triggers to fire.

github-actions bot added the Target: main label Dec 13, 2022

rhc54 reviewed Jan 14, 2023

View reviewed changes

amirshehataornl closed this by deleting the head repository Jan 24, 2023

amirshehataornl mentioned this pull request Jan 24, 2023

mtl/ofi: NIC selection update #11338

Merged

mtl/ofi: NIC selection update #11206

mtl/ofi: NIC selection update #11206

Uh oh!

Conversation

amirshehataornl commented Dec 13, 2022

Uh oh!

ompiteam-bot commented Dec 13, 2022

Uh oh!

rhc54 commented Dec 13, 2022

Uh oh!

amirshehataornl commented Dec 13, 2022

Uh oh!

naughtont3 commented Dec 13, 2022

Uh oh!

hppritcha commented Dec 13, 2022

Uh oh!

hppritcha commented Dec 13, 2022

Uh oh!

rhc54 commented Dec 13, 2022

Uh oh!

rhc54 commented Dec 13, 2022

Uh oh!

amirshehataornl commented Dec 14, 2022

Uh oh!

rhc54 commented Dec 14, 2022

Uh oh!

amirshehataornl commented Dec 14, 2022

Uh oh!

rhc54 commented Dec 14, 2022

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 11, 2023

Uh oh!

rhc54 commented Jan 11, 2023

Uh oh!

amirshehataornl commented Jan 12, 2023

Uh oh!

rhc54 commented Jan 13, 2023

Uh oh!

bgoglin commented Jan 13, 2023

Uh oh!

amirshehataornl commented Jan 13, 2023

Uh oh!

amirshehataornl commented Jan 13, 2023

Uh oh!

rhc54 commented Jan 13, 2023

Uh oh!

bgoglin commented Jan 13, 2023

Uh oh!

hppritcha commented Jan 13, 2023

Uh oh!

rhc54 commented Jan 13, 2023

amirshehataornl commented Jan 11, 2023 •

edited

Loading

amirshehataornl commented Jan 14, 2023 •

edited

Loading