Skip to content

Conversation

@amirshehataornl
Copy link
Contributor

The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false.

A better way of determining if the PCI object matches a process CPU set is to use the NUMA node as the common denominator.

Find all NUMA nodes on the system. Find which one intersects with the process' cpu set and then determine if this NUMA node intersects with the non-IO ancestor node list. If both these conditions are true then the PCI devices matches the process.

Another change this patch brings is using HWLOC_CPUBIND_THREAD instead of HWLOC_CPUBIND_PROCESS when finding the cpuset of the process. There are cases, where a process is initially bound to a set of CPUs, but on initialization the process can spawn more threads (some of which can be spawned by 3rd party libraries). These threads can be explicitly affined differently from the initial binding.

One example of that is the HSA library. It spawns a thread upon initialization and explicitly binds it to all available CPUs. If we use HWLOC_CPUBIND_PROCESS, then this will result in less than performant process to NIC binding.

It is safer to make the assumption that the thread which is currently running this code has the correct binding and base the algorithm off that.

Signed-off-by: Amir Shehata [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2022

I would caution that NUMA is no longer a well-defined entity in today's multi-chip packages - you may find that this proposed change behaves in an unexpected manner on other architectures.

@amirshehataornl
Copy link
Contributor Author

I would caution that NUMA is no longer a well-defined entity in today's multi-chip packages - you may find that this proposed change behaves in an unexpected manner on other architectures.

@rhc54, The other option is to use the PMIx distance table you pointed me to in the comment on the issue I opened. You think that would be more reliable? Also if you can give me some initial pointers of where I can see how that is used, that'll be helpful. Thanks

@naughtont3
Copy link
Contributor

@rhc54 Also, do you recommend a minimum hwloc version that we should use?

@hppritcha
Copy link
Member

bot:ok to test

@hppritcha
Copy link
Member

please make sure to test this PR with a singleton test

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2022

please make sure to test this PR with a singleton test

By definition, singleton's in OMPI v5 and above are not bound to anything as all the "self-bind at startup" code has been removed. It is therefore impossible to define a "distance" for them.

True, you could bind the singleton at some point - but that is likely a mistake. First, you have no idea what the user would want in terms of binding. Second, any binding that is done will be picked up by PRRTE if the singleton calls "comm_spawn" and used as a constraining window for all subsequent process starts. Probably not what anyone would really want.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2022

@amirshehataornl @naughtont3

Let me answer your questions by starting with a little history. In my former life at Intel (back about 4-5 years ago), I setup a grant for Brice (of HWLOC fame) to work with us on topology discovery/description. We were developing rather complex multi-chip packages and seeing performance issues that suspiciously looked like they were topology related - i.e., the codes weren't properly understanding the MCP topology and mapping/binding in sub-optimal ways.

Long story short, the problem was that the codes were trying to operate on the basis of "NUMA" domains - yet the NUMA concept wasn't that clear in these systems. There was memory on each CPU die (with multiple of those in a package), additional memory inside the package (divided between the CPUs by the BIOS - and therefore alterable), memory on the PCIe bus, memory in the GPUs and fabric interfaces, etc - with the type of each of those memories being potentially different (DRAM, NVRAM, etc.). All of that memory was now potentially in the address space of the CPUs in the package - so when you asked to "bind to NUMA", precisely which NUMA(s) are you referring to? Far from clear.

This is why you won't find NUMA-to-NUMA distance matrices in the HWLOC topology any more, at least in most cases. Really not very useful. It's also why I keep nagging people over here to update their NIC selection logic and stop using NUMA - but that's another battle.

What we are doing in its place is providing every process with a complete distance map of its location to every device of interest on your local node. You can then do with it what you like. We also provide each process with the map for every other process in the job, so you can (for example) know that rank 17 is going to select NIC 3 on its node as that is the closest to it, and then decide to use your NIC 2 to connect to it. At some point in the near future, we plan to add switch info so you can see which NICs share a switch, and thus potentially optimize collectives.

PRRTE has been providing the distance info for at least 2 years now. It is currently "off" by default, governed by the PRRTE MCA parameter "pmix_generate_distances":

    (void) pmix_mca_base_var_register("prte", "pmix", NULL, "generate_distances",
                                      "Device types whose distances are to be provided (default=none, options=fabric,gpu,network",
                                      PMIX_MCA_BASE_VAR_TYPE_BOOL,
                                      &generate_dist);
    prte_pmix_server_globals.generate_dist = 0;
    if (NULL != generate_dist) {
        tmp = PMIX_ARGV_SPLIT_COMPAT(generate_dist, ',');
        for (i=0; NULL != tmp[i]; i++) {
            if (0 == strcasecmp(tmp[i], "fabric")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_OPENFABRICS;
            } else if (0 == strcasecmp(tmp[i], "gpu")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_GPU;
            } else if (0 == strcasecmp(tmp[i], "network")) {
                prte_pmix_server_globals.generate_dist |= PMIX_DEVTYPE_NETWORK;
            }
        }
    }

We can add other device types fairly easily. Note that Brice and I spent a fair amount of time coming up with a reasonable "distance" metric for computing these values. They are purely relative and not time measurements, so a distance of two means the device is twice as far from you as a device at distance of one. It should be somewhat correlated to time, but not rigorously so.

You can read more about all this in Section 11.4 of the PMIx Standard, which talks about the server-side for computing the distances. You retrieve them using PMIx_Get, passing the ID of the proc of interest with the PMIX_DEVICE_DISTANCES key. You'll get an array of distances, one for each device. I need to add the ability to "qualify" that request to only return the devices of interest.

Any HWLOC released in the last 4 years is fine. You'll also need PMIx v4.0 or above, and PRRTE v3.0 to provide the data.

HTH
Ralph

@amirshehataornl
Copy link
Contributor Author

Thanks Ralph. This was very helpful.

@rhc54
Copy link
Contributor

rhc54 commented Dec 14, 2022

If you experiment with this, please let me know how it works for you - the distance algorithm really hasn't been well tested yet for usefulness (i.e., how well can you optimize in various scenarios based on it). I'd be happy to point you to where it is done in PMIx so you can tinker and improve upon it.

@amirshehataornl
Copy link
Contributor Author

@rhc54, will do. I'll definitely be trying it on our system. Pointers in the code would be helpful. I've been looking at PMIx_compute_distances(). Is that it?

@rhc54
Copy link
Contributor

rhc54 commented Dec 14, 2022

I've been looking at PMIx_compute_distances(). Is that it?

Yep, that's the one!

@amirshehataornl
Copy link
Contributor Author

@rhc54 happy new year. I'm back looking at this
In order to get the distances from the current processes to the NICs, you need to get the cpu binding of the process. PMIx provides PMIx_Get_cpuset(). This API requires pmix_globals.topology to have been initialized. pmix_globals.topology is initialized in PMIx_Init(). Looking at PMIx_Init(), it doesn't look like it initializes pmix_globals.topology unless you pass it an info structure, initialized with the topology. And in order to get the PMIx topology, you need to have initialized PMIx. There is a loop here, and not sure how it could be broken. I thought of calling PMIx_Init() twice, but the second time it exits early because it has already been initialized. Is there a way to initialize pmix_globals.topology independently? What am I missing?

I ended up initializing the cpuset manually, as the pmix_cpuset_t is just a thin wrapper on top of hwloc bitmap, test code below.

444 static void pmix_test(struct fi_pci_attr pci, hwloc_bitmap_t proc_cpuset)                         
445 {  
446 »···int rc;                                                                                       
447 »···pmix_topology_t *topo;                                                                        
448 »···pmix_proc_t this_proc;
449 »···pmix_cpuset_t cpuset;
450 »···pmix_info_t *info;                                                                            
451 »···size_t ninfo;
452 »···pmix_device_distance_t *distances;                                                            
453 »···size_t ndist;
454 »···pmix_device_type_t type = PMIX_DEVTYPE_OPENFABRICS |                                          
455 »···  PMIX_DEVTYPE_NETWORK |
456 »···  PMIX_DEVTYPE_COPROC |                                                                       
457 »···  PMIX_DEVTYPE_GPU |                                                                          
458 »···  PMIX_DEVTYPE_BLOCK |                                                                        
459 »···  PMIX_DEVTYPE_DMA;
460
461 »···PMIX_TOPOLOGY_CREATE(topo, 8);                                                                
462 »···/* TODO problem with PMIx_Get_cpuset() */                                                     
463
464 »···memset(&cpuset, 0, sizeof(cpuset));                                                           
465 »···cpuset.source = strdup("hwloc");
466 »···cpuset.bitmap = proc_cpuset;                                                                  
467
468 »···rc = PMIx_Load_topology(topo);
469 »···if (rc) {
470 »···»···fprintf(stderr, "failure loading pmix topology\n");                                       
471 »···»···return;
472 »···}
473
474 »···ninfo = 1;
475 »···PMIX_INFO_CREATE(info, ninfo);                                                                
476 »···PMIX_INFO_LOAD(&info[0], PMIX_DEVICE_TYPE, &type, PMIX_DEVTYPE);
477 »···rc = PMIx_Compute_distances(topo, &cpuset, info, ninfo, &distances,
478 »···»···»···»···»···»···»···»···&ndist);
479 »···PMIX_INFO_FREE(info, ninfo);
480 »···if (PMIX_SUCCESS != rc) {
481 »···»···fprintf(stderr, "failed to compute distances\n");                                         
482 »···»···return;
483 »···}
484 }

That worked. And I did get the distances but only from the process to the NICs. The results look correct based on our topology.

(gdb) p distances[0]
$4 = {uuid = 0x7fffc4014550 "ipv4://02:00:00:00:09:b3", osname = 0x7fffc4014580 "hsn2", type = 4, mindist = 6, maxdist = 6}
(gdb) p distances[1]
$5 = {uuid = 0x7fffc40145a0 "ipv4://00:40:a6:86:31:15", osname = 0x7fffc40145d0 "ens2", type = 4, mindist = 6, maxdist = 6}
(gdb) p distances[2]
$6 = {uuid = 0x7fffc40145f0 "ipv4://02:00:00:00:09:f3", osname = 0x7fffc4014620 "hsn1", type = 4, mindist = 7, maxdist = 7}
(gdb) p distances[3]
$7 = {uuid = 0x7fffc4014640 "ipv4://02:00:00:00:09:b2", osname = 0x7fffc4014670 "hsn3", type = 4, mindist = 7, maxdist = 7}
(gdb) p distances[4]
$8 = {uuid = 0x7fffc4014690 "ipv4://02:00:00:00:09:f2", osname = 0x7fffc40146c0 "hsn0", type = 4, mindist = 7, maxdist = 7}

However, should I be getting distances to other objects, like the GPUs for example seeing that I had that in the type bit field? I'll delve into this in more details, but any insight would be greatly appreciated.

The above test code works regardless if I have the below parameter set or not. Which is good. But doesn't align with what you mentioned, regarding it being turned off by default.

--prtemca pmix_generate_distances on

Another issue I ran into is if I do this:

--prtemca pmix_generate_distances 1

I get a segfault. I'll investigate and see if I can push up a fix

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

There is a loop here, and not sure how it could be broken.

I'm rather confused as to what you are trying to do. PRRTE computes the distances and includes them in the data provided to a process - all you have to do is access them via PMIx_Get.

If you are trying to compute the distances yourself at the application level, then yes, you do have to get your topology. One way to do it is just to use PMIx_Get and ask PMIx to provide you with a copy of the topology so you can use it.

However, should I be getting distances to other objects, like the GPUs for example seeing that I had that in the type bit field?

I honestly don't remember offhand where I left the implementation - could be that I didn't work thru the OR'd bit fields.

--prtemca pmix_generate_distances on

We won't recognize that parameter value - it's a boolean param, so the word "on" means nothing to it

--prtemca pmix_generate_distances 1 generates a segfault

That would be the more concerning issue as it means PRRTE is segfaulting when attempting to compute the distances.

@amirshehataornl
Copy link
Contributor Author

amirshehataornl commented Jan 11, 2023

@rhc54 , The problem is with PMIx_Get_cpuset(). It crashes when I call it. Should I be able to use that API?

The reason for the crash is:

PMIx_Get_cpuset()
 -> pmix_hwloc_get_cpuset()
   -> hwloc_get_cpubind(pmix_globals.topology.topology, cpuset->bitmap, flag);

pmix_globals.topology.topology == NULL

I call PMIx_Get_cpuset() from OMPI in the opal_common_ofi_select_provider() path.

I'm trying to understand how pmix_globals.topology.topology is suppose to be initialized.

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

Hmmm...interesting dilemma! Technically, yes you should be able to use that API. At a more practical level, I've only ever used it on the server side of things as that is where we compute distances and such.

Looking at the code in pmix_client_topology.c and pmix_hwloc.c, I can see where the bug lies. Let me post a fix for that part of the problem.

@amirshehataornl
Copy link
Contributor Author

@rhc54, thanks.

And I think I'm starting to see the intended workflow.
The server side calls PMIx_Compute_distances() from prte_pmix_server_register_nspace() if the pmix_generate_distances is set.
Then from the client side you use PMIx_Get with PMIX_DEVICE_DISTANCES to grab the distances.

However, since the APIs are there and should be usable from the client as well, it seems like calling PMIx_Compute_distances() directly would avoid us having to rely on configuration, which may or may not be set. Thoughts?

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

This (openpmix/openpmix#2907) should fix the dilemma.

PMIx_Compute_distances() directly would avoid us having to rely on configuration, which may or may not be set. Thoughts?

There are tradeoffs, of course. You gain portability by doing it yourself as you don't know if PRRTE is around or not, and maybe the host environment isn't as kind as PRRTE. Only real negative is that you can't easily generate the distances for the remote side.

Here is what I would recommend. Try to perform a PMIx_Get for what you need. If it doesn't succeed, then go ahead and compute it yourself. The "get" call is pretty cheap once the new shmem subsystem is enabled, so it is probably worth trying.

Meantime, if you are finding this useful, then we should turn on the generate distances feature by default, and ensure that we are computing the distances for the devices of interest. Might encourage others to use them as well 😄

@amirshehataornl
Copy link
Contributor Author

@rhc54, I have another more general topology question.
PMIx is able to gather the topology of processes on all nodes? or just on the local node?
Let's say my job has two nodes, A and B and each node has 1 process running on it. Can the process running on A find the topology of the process running on B?

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

@rhc54, I have another more general topology question. PMIx is able to gather the topology of processes on all nodes? or just on the local node? Let's say my job has two nodes, A and B and each node has 1 process running on it. Can the process running on A find the topology of the process running on B?

By topology, do you mean the topology of the node? Or do you mean the binding of the process running on B?

The answer to the latter is "yes" - we provide you with the binding/cpuset of all procs in the job.

If you want the node's topology, that's more difficult to get. It is unlikely that the runtime will share full topologies around the job. However, PRRTE does do this, and so in that case it should be possible to retrieve it. I'd have to look to see the semantics for such a request.

@amirshehataornl
Copy link
Contributor Author

@rhc54 found the cause of the crash when specifying generate_dist variable:
openpmix/prrte#1637

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

Hooray! Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

Here is what I would recommend. Try to perform a PMIx_Get for what you need. If it doesn't succeed, then go ahead and compute it yourself. The "get" call is pretty cheap once the new shmem subsystem is enabled, so it is probably worth trying.

BTW: when attempting this "get", be sure to pass an "optional" directive to it. This will instruct PMIx_Get to just check its local cache for the value and return if not found. Otherwise, it will poke the local server to try and retrieve it, and that gets more expensive. Looks like this:

pmix_info_t directive;

PMIX_INFO_LOAD(&directive, PMIX_OPTIONAL, NULL, PMIX_BOOL);
PMIx_Get(proc, key, &directive, 1, &value);
PMIX_INFO_DESTRUCT(&directive);

@amirshehataornl
Copy link
Contributor Author

@rhc54, going back to the segfault
I'm still running into a problem.
when calling

PMIX_INFO_LIST_ADD(ret, pmap, PMIX_DEVICE_DISTANCES, &darray, PMIX_DATA_ARRAY)

in pmix_server_register_fns.c:prte_pmix_server_register_nspace()

darray is populated as follows

darray.array = distances;
darray.size = ndist;

It's missing the darray.type. This causes a segfault later on.

I tried setting darray.type = PMIX_INFO.

But I don't think that's right either.

It looks like it ought to be set to PMIX_POINTER.

But would like confirmation.

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

It should be PMIX_DEVICE_DIST

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

Note that the "info_list_add" code is going to make a copy of that array, and so there may be some issue in that code path. The array should consist of ndist number of pmix_device_distance_t structures.

@amirshehataornl
Copy link
Contributor Author

@rhc54 fixed the other crash I ran into:
openpmix/prrte#1638

Also I'm confirming that PMIx_Get(distance) works correctly. It succeeds if the generate_distances is turned on, otherwise it fails.

I'd like to turn on distances by default. Are you okay with that? Basically the default value will be "fabric,gpu,network"

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2023

I'd like to turn on distances by default. Are you okay with that? Basically the default value will be "fabric,gpu,network"

Absolutely - thanks!

@amirshehataornl
Copy link
Contributor Author

@rhc54, turned on distance generation by default: openpmix/prrte#1639
Next step is update this patch to use distances instead of NUMA binding to determine nearest interface.

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2023

Afraid I haven't seen their entry before - perhaps it would help if you provided the full entry for this device? Sounds like they didn't follow convention.

@bgoglin
Copy link
Contributor

bgoglin commented Jan 13, 2023

@bgoglin, @rhc54, thanks for the pointers. I'm gdbing through these structures and the osdev doesn't have a node "NodeGUID" infos entry, just the "Address" one. Is that expected? Should we be parsing out the NodeGUID and the SystemImageGUID from the address?

infos = 0x7fffc41cb4c0,
infos_count = 1,

(gdb) p *obj->io_first_child->infos
$12 = { 
  name = 0x7fffc41cb550 "Address",
  value = 0x7fffc41cb570 "02:00:00:00:12:e2"

Can you post the part of lstopo -v that contains this OS device, its parent PCI dev, and the other OS devices below it?

@amirshehataornl
Copy link
Contributor Author

        HostBridge L#2 (buses=0000:[d4-d6])
          PCIBridge L#3 (busid=0000:d4:01.1 id=1022:14be class=0604(PCIBridge) link=31.51GB/s buses=0000:[d5-d5])
            PCI L#1 (busid=0000:d5:00.0 id=17db:0501 class=0200(Ethernet) link=31.51GB/s)
              Network L#0 (Address=02:00:00:00:12:a3) "hsn2"
          PCIBridge L#4 (busid=0000:d4:07.1 id=1022:14bf class=0604(PCIBridge) link=31.51GB/s buses=0000:[d6-d6])
            PCI L#2 (busid=0000:d6:00.0 id=1002:7408 class=0380(Display) link=31.51GB/s)

@amirshehataornl
Copy link
Contributor Author

if NodeGUID is another infos should it appear in the Network L#0 (Address=...) "hsn2" line?
I also did lstopo -v on my local machine and I don't see except the Address entry in the Network line. But not sure if that's just cause that's the only thing which is printed

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2023

Hmmm...I wonder if the issue here is that they are identified as an "Ethernet" device instead of an "OpenFabrics" device? An Ethernet device wouldn't have GUIDs associated with it - just an IPv4/6 address. I'll bet that is the problem.

What are you seeing in the device_dist struct? I'm wondering if we just aren't constructing the struct info correctly, or perhaps misidentifying the device type.

@bgoglin
Copy link
Contributor

bgoglin commented Jan 13, 2023

Oh, wait, this is Cray device? Cray has never answered any of my request for information or access to hardware. Hence I don't know where to get Cray-specific information such as GUID, etc. So there's no hwloc support at all, you only see basic Network (Ethernet) information.

@hppritcha
Copy link
Member

the slingshot11 nics are basically ethernet cards with some bells and whistles.

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2023

Perhaps someone can provide Brice with access to a machine so he can try to improve the HWLOC support? I know I tried to get it for him back in my Intel days, but left before we ever say a device.

@bgoglin
Copy link
Contributor

bgoglin commented Jan 13, 2023

Can you post the output of a libfabric tool that shows device attributes, addresses, etc for that hsn device? I'd like to get an idea of what hwloc could expose.

@hppritcha
Copy link
Member

contacting Brice about this option offline

@amirshehataornl
Copy link
Contributor Author

(gdb) p current_provider->nic
$1 = (struct fid_nic *) 0xcfed00
(gdb) p *current_provider->nic
$2 = {
  fid = {
    fclass = 20,
    context = 0x0,
    ops = 0x7fffe2724c80 <default_nic_ops>
  },
  device_attr = 0xcfed40,
  bus_attr = 0xcfee20,
  link_attr = 0xcfee40,
  prov_attr = 0x0
}
(gdb) p *current_provider->nic->device_attr
$3 = {
  name = 0xcfed80 "cxi0",
  device_id = 0xcfeda0 "0x501",
  device_version = 0xcfedc0 "2",
  vendor_id = 0xcfede0 "0x17db",
  driver = 0xcfee00 "cxi_core",
  firmware = 0x0
}
(gdb) p *current_provider->nic->bus_attr
$4 = {
  bus_type = FI_BUS_PCI,
  attr = {
    pci = {
      domain_id = 0,
      bus_id = 197 '\305',
      device_id = 0 '\000',
      function_id = 0 '\000'
    }
  }
}
(gdb) p *current_provider->nic->link_attr
$5 = {
  address = 0xcfee70 "0x12e2",
  mtu = 2112,
  speed = 200000000,
  state = FI_LINK_UP,
  network_type = 0xcfee90 "HPC Ethernet"
}

@amirshehataornl
Copy link
Contributor Author

contacting Brice about this option offline

which option?

@hppritcha
Copy link
Member

getting him an account on a box with ss11 nics

@bgoglin
Copy link
Contributor

bgoglin commented Jan 13, 2023

Looks like there isn't much useful in the libfabric output. The address doesn't even look similar to what hwloc is seeing (maybe GDB doesn't know how to display it, is it a variable length array?)

@amirshehataornl
Copy link
Contributor Author

The reason why I'm going through this exercise at the moment is to be able to convert the current process/NIC binding code in OMPI to use distances. And make the code generic (not specific to the frontier platform).
Basically, with what's available at the moment, here are my thoughts:

  1. Get the device distances from current process
  2. Identify the device(s) nearest the process. Could be more than 1 if the distances are the same
  3. Find the matching fi_info blocks:
    3a. Go through the fi_info block list returned by fi_getinfo() and use the pci information to look up the pci object
    3b. Iterate through the osdevs related to the pci object (which could be multiple of them) and compare the address, with the address in the distance structure. The assumption here is if there are multiple os objects, each one will have a unique address.
    3c. matching fi_info is added to a list
  4. if there is only one fi_info in the list, then select that one
  5. otherwise use the process rank to select one.

Sound reasonable?

@amirshehataornl
Copy link
Contributor Author

Looks like there isn't much useful in the libfabric output. The address doesn't even look similar to what hwloc is seeing (maybe GDB doesn't know how to display it, is it a variable length array?)

struct fi_link_attr {
	char			*address;
	size_t			mtu;
	size_t			speed;
	enum fi_link_state	state;
	char			*network_type;
};

struct fi_bus_attr {
	enum fi_bus_type	bus_type;
	union {
		struct fi_pci_attr	pci;
	} attr;
};

struct fi_device_attr {
	char			*name;
	char			*device_id;
	char			*device_version;
	char			*vendor_id;
	char			*driver;
	char			*firmware;
};

struct fid_nic {
	struct fid		fid;
	struct fi_device_attr	*device_attr;
	struct fi_bus_attr	*bus_attr;
	struct fi_link_attr	*link_attr;
	void			*prov_attr;
};

Address is written as follows:

	ret = asprintf(&fi->nic->link_attr->address, "0x%x",
		       nic_if->info->nic_addr);

So I don't think that gdb is mis-interpreting it

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2023

I suspect the problem is that libfabric is obtaining the info from HWLOC, which has no real access (at this point) to the Cray information because they haven't shared it. I doubt we can resolve how to use device_dist until HWLOC can support this hardware, so @hppritcha is likely taking the right path.

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2023

@amirshehataornl One thing you could investigate is if PMIx is correctly identifying this device as a "network" vs "openfabric" device. If so, then it should be providing the IP address as the UUID. Of course, that won't help if the address being reported by HWLOC differs from that you are getting from the Cray OFI provider.

Maybe you should look at that provider and see where it is getting the address? Could help explain the difference.

@amirshehataornl
Copy link
Contributor Author

@rhc54, @bgoglin I updated the patch, can you guys take a look and let me know what you think of this approach.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think that should work - but only for the Cray and other Ethernet providers. I realize that covers the range of your interest, but is this going to cause a problem for OpenFabric providers?

distances = (pmix_device_distance_t*)dptr->array;

for (i = 0; i < dptr->size; i++)
fprintf(stderr, "%d: %d:%s:%d:%d\n", getpid(), i, distances[i].uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame on me - I missed the pmix_device_distances_t struct when providing "pretty-print" functions. I'll address that soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand why this wouldn't work for other HW?

+ strlen(distances[i].uuid))
continue;
if (!strcmp(addr+3, address)) {
fprintf(stderr, "%d matched distance addr %s with %s\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Cray OFI provider's IP address matches that reported by HWLOC, then this should work (or so I expect).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I'm going to remove the fprintfs. These slipped in from my debugging

@rhc54
Copy link
Contributor

rhc54 commented Jan 14, 2023

Can you expand why this wouldn't work for other HW?

The basic approach would work, just not the mechanics you have here. The problem lies in the precise attributes provided for each class of device. Ethernet devices have IP addresses associated with them - since the Cray SS11 is identified in that class, you can compare the PMIx UUID with the "address" attribute. These devices are classed as network devices.

However, Infiniband devices are classed as OpenFabric devices. If they have been configured to support IP-over-IB, then they will have a port that is shown as a network device and has an IP address - but that isn't so common. They will also have ports identified as OpenFabric devices - this is the IB network connection. These ports will have NodeGUID and SystemGUID attributes (not address attributes), and so you would have to compare the PMIx UUID with those attributes.

You'll find the same to be true of GPUs and other device types - see the PMIx device distance code for more details:

                if (HWLOC_OBJ_OSDEV_NETWORK == table[n].hwtype) {
                    char *addr = NULL;
                    /* find the address */
                    for (i = 0; i < device->infos_count; i++) {
                        if (0 == strcasecmp(device->infos[i].name, "Address")) {
                            addr = device->infos[i].value;
                            break;
                        }
                    }
                    if (NULL == addr) {
                        /* couldn't find an address - report it as an error */
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                    /* could be IPv4 or IPv6 */
                    cnt = countcolons(addr);
                    if (5 == cnt) {
                        pmix_asprintf(&d->dist.uuid, "ipv4://%s", addr);
                    } else if (19 == cnt) {
                        pmix_asprintf(&d->dist.uuid, "ipv6://%s", addr);
                    } else {
                        /* unknown address type */
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                } else if (HWLOC_OBJ_OSDEV_OPENFABRICS == table[n].hwtype) {
                    char *ngid = NULL;
                    char *sgid = NULL;
                    /* find the UIDs */
                    for (i = 0; i < device->infos_count; i++) {
                        if (0 == strcasecmp(device->infos[i].name, "NodeGUID")) {
                            ngid = device->infos[i].value;
                        } else if (0 == strcasecmp(device->infos[i].name, "SysImageGUID")) {
                            sgid = device->infos[i].value;
                        }
                    }
                    if (NULL == ngid || NULL == sgid) {
                        PMIX_LIST_DESTRUCT(&dists);
                        return PMIX_ERROR;
                    }
                    pmix_asprintf(&d->dist.uuid, "fab://%s::%s", ngid, sgid);
                } else if (HWLOC_OBJ_OSDEV_GPU == table[n].hwtype) {
                    /* if the name starts with "card", then this is just the aux card of the GPU */
                    if (0 == strncasecmp(device->name, "card", 4)) {
                        pmix_list_remove_item(&dists, &d->super);
                        PMIX_RELEASE(d);
                        device = hwloc_get_next_osdev(topo->topology, device);
                        continue;
                    }
                    pmix_asprintf(&d->dist.uuid, "gpu://%s::%s", pmix_globals.hostname,
                                  device->name);
                } else {
                    /* unknown type */
                    pmix_list_remove_item(&dists, &d->super);
                    PMIX_RELEASE(d);
                    device = hwloc_get_next_osdev(topo->topology, device);
                    continue;
                }

@amirshehataornl
Copy link
Contributor Author

amirshehataornl commented Jan 14, 2023

Makes sense. Keeping in mind that this function is specifically targeting network interfaces; IE find the nearest networking/openfabric device to the process, then it should be enough to check the type of the device and basically do the logic in the HWLOC_OBJ_OSDEV_OPENFABRICS and HWLOC_OBJ_OSDEV_NETWORK.
HWLOC_OBJ_OSDEV_OPENFABRICS: check NodeGUID and SysImageGUID
HWLOC_OBJ_OSDEV_NETWORK: check Address

We don't need to handle GPUs in this function.
Do you agree this would be sufficient?

My goal is to land this patch in OMPI, so we don't have to track it separately.

@rhc54
Copy link
Contributor

rhc54 commented Jan 14, 2023

Yes - if you handle the two cases, you should be fine.

The existing code in compare_cpusets assumed that some non_io ancestor of a
PCI object should intersect with the cpuset of the proc. However, this is
not true. There is a case where the non IO ancestor can be an L3. If there
exists two L3s on the same NUMA and the process is bound to one L3, but
the PCI object is connected to the other L3, then compare_cpusets() will
return false.

A better way to determine the optimal interface is by finding the
distances of the interfaces from the current process. Then find out which
of these interfaces is nearest the process and select it.

Use the PMIx distance generation for this purpose.

Signed-off-by: Amir Shehata <[email protected]>
@amirshehataornl
Copy link
Contributor Author

@rhc54, @bgoglin, would be able to take a look at this patch and let me know if you're good with it, or if you'd like to see further changes?

@bgoglin
Copy link
Contributor

bgoglin commented Jan 20, 2023

I don't see anything wrong but I don't know well what PMIx and libfabric report here, hence it's impossible to be sure of what's going to happen during your tests.

@rhc54
Copy link
Contributor

rhc54 commented Jan 21, 2023

I'm good with it - I believe it will do what you seek. Will obviously need exposure to various environments to be sure.

@rhc54
Copy link
Contributor

rhc54 commented Jan 21, 2023

FWIW: the error is due to Java failures to pull the repos. I believe there has been a change to the CI since this PR was originally created, so you may need to do a rebase against main to get the right CI triggers to fire.

@amirshehataornl amirshehataornl closed this by deleting the head repository Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants