Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal/ofi: refactor NIC selection logic #11689

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented May 17, 2023

This patch refactors the OFI NIC selection logic. It foremost improves
the NIC search algorithm. Instead of searching for the closest NICs on
the system, this patch directly compares the distances of the given
providers and selects the nearest NIC.

This change also makes it explicit that if the process is unbound, or
the distance cannot be reliably calculated, a provider will be selected
in round-robin fashion.

@wenduwan
Copy link
Contributor Author

@amirshehataornl Could you please review this change?

opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
@wenduwan wenduwan added the bug label May 19, 2023
@wenduwan wenduwan requested a review from hppritcha May 19, 2023 17:52
@wenduwan
Copy link
Contributor Author

Requested second opinions from @hppritcha

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass, some new thoughts, sorry for the double-review.

opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
@wenduwan
Copy link
Contributor Author

After thinking about OPAL_OFI_PCI_DATA_AVAILABLE a little bit, it seems is_near only needs that macro when calling match_device_by_pci_bus_id. Do we need that macro if we want to use match_device_by_name_prefix for EFA?

No preference here. It works in either case since the function is only used in is_near which is protected by OPAL_OFI_PCI_DATA_AVAILABLE.
Moved the function outside the macro in update.

@wenduwan
Copy link
Contributor Author

I have observed an unrelated segfault in opal with old hwloc 1.11.8. Will look at that separately.

#0  0x00007f91c2a961be in fill_cache_line_size () at base/hwloc_base_util.c:87
#1  0x00007f91c2a96a33 in opal_hwloc_base_get_topology () at base/hwloc_base_util.c:325
#2  0x00007f91c2a5f528 in opal_common_ofi_select_provider (provider_list=0x8f0190, process_info=0x7f91c2cfdbc0 <opal_process_info>) at common_ofi.c:825
#3  0x00007f91c3b9951a in select_ofi_provider (providers=0x8f0190, include_list=0x0, exclude_list=0x8d02c0) at mtl_ofi_component.c:356
#4  0x00007f91c3b9a8e9 in ompi_mtl_ofi_component_init (enable_progress_threads=false, enable_mpi_threads=false, accelerator_support=0x7f91c4056730 <mca_mtl_ofi_component+272>) at mtl_ofi_component.c:779
#5  0x00007f91c3b8fd6f in ompi_mtl_base_select (enable_progress_threads=false, enable_mpi_threads=false, priority=0x7fffd1da6c64) at base/mtl_base_frame.c:78
#6  0x00007f91c3cf1fd5 in mca_pml_cm_component_init (priority=0x7fffd1da6c64, enable_progress_threads=false, enable_mpi_threads=false) at pml_cm_component.c:146
#7  0x00007f91c3cc52b0 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at base/pml_base_select.c:127
#8  0x00007f91c39af79c in ompi_mpi_instance_init_common (argc=5, argv=0x7fffd1da7a98) at instance/instance.c:508
#9  0x00007f91c39b014d in ompi_mpi_instance_init (ts_level=0, info=0x62d9e0 <ompi_mpi_info_null>, errhandler=0x7f91c40677e0 <ompi_mpi_errors_are_fatal>, instance=0x7f91c40788c0 <ompi_mpi_instance_default>, argc=5, argv=0x7fffd1da7a98) at instance/instance.c:814
#10 0x00007f91c399fa4f in ompi_mpi_init (argc=5, argv=0x7fffd1da7a98, requested=0, provided=0x7fffd1da78ac, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#11 0x00007f91c3a04d05 in PMPI_Init (argc=0x7fffd1da792c, argv=0x7fffd1da7920) at init.c:67
#12 0x00000000004026e3 in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:61

@lrbison
Copy link
Contributor

lrbison commented May 22, 2023

the function is only used in is_near which is protected by OPAL_OFI_PCI_DATA_AVAILABLE

I guess my point is, don't we want is_near to work on "efa" provider even without OPAL_OFI_PCI_DATA_AVAILABLE?

Or maybe it's a question: could it? I didn't see any reason we need pci data for that case.

@wenduwan
Copy link
Contributor Author

I guess my point is, don't we want is_near to work on "efa" provider even without OPAL_OFI_PCI_DATA_AVAILABLE?
Or maybe it's a question: could it? I didn't see any reason we need pci data for that case.

Maybe not so much for this chagne. But it will become more obvious with my upcoming change wrt accelerator awareness, which actually uses struct fi_info.nic.

We could experiment with removing OPAL_OFI_PCI_DATA_AVAILABLE. Thinking about it, we need to test it with some old libfabric to be certain.

return true;

if (fi_info && fi_info->fabric_attr && fi_info->fabric_attr->prov_name
&& 0 == strcasecmp("efa", fi_info->fabric_attr->prov_name)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh; I don't understand this comment or why we're using ibv information here; let's step back and assume we've gone wrong somewhere and try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwbarrett Oh do you mean the NIC selection logic is ambiguous?

The current logic is:

  1. Find the list of nearest NICs relative to the current process' package
  2. Enumerate over the candidate NICs, and determine which ones are in the above list
  3. Select 1 NIC from the list based on (rank on package) % (#nearest NICs)

What went wrong here is in 2) determine which ones are in the above list - the code was attempting to match NodeGUID&SysImageGUID, both of which are 0's for EFA.

Or did I not understand your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess I don't understand why we're using NodeGUID and SysImageGUID for that at all, or why they're 0. BUt I object to there being EFA-specific code here, as there's nothing about EFA that should make it need special code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It was introduced in this change d4e1ae5

I guess Amir had a reason to check the guid values for his platform. Since I can only speak for EFA, could @amirshehataornl chime in to the rationale of the change?

In general I agree with Brian's assessment. If we don't need guids I can attempt a more generic/simpler logic, e.g. comparing PCI BDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwbarrett I have refactored the logic that's incompatible with EFA, such that we do not need to call out EFA(or other vendor) in particular.

@amirshehataornl I reckon this conflicted with your previous refactor. Could you take a look and test it out on your system? FWICT it also addresses your previous concern with cpuset intersection with the wrong L3 cache.

@wenduwan
Copy link
Contributor Author

I have observed an unrelated segfault in opal with old hwloc 1.11.8. Will look at that separately.

Confirmed that this is no-issue. Gotta make clean between switching from internal hwloc to external.

@wenduwan wenduwan changed the title opal/mca/common: match efa nic by device name opal/ofi: refactor NIC selection logic May 25, 2023
Copy link
Contributor

@amirshehataornl amirshehataornl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving away from using PMIx distances and re-implementing the code using hwloc directly.
Can you please give a justification for not using PMIx distances? Does this implementation give us something not provided via PMIx?

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2023

I thought this was intended to deal only with the case where PMIx wasn't providing the device distance. However, looking at what is now in the PR, I have to agree with @amirshehataornl questions - why was the PMIx code removed???

@wenduwan
Copy link
Contributor Author

@amirshehataornl @rhc54 After syncing with @bwbarrett I realized that there are 2 issues with the prior implementation.

The primary concern was around the use of:

typedef struct pmix_device_distance {
    char *uuid;
    char *osname;
    pmix_device_type_t type;
    uint16_t mindist;
    uint16_t maxdist;
} pmix_device_distance_t;

Uniqueness of uuid and osname is vendor-specific, which led me to introduce additional logic to handle EFA case in the 1st iteration. Come to think about it, PCI BDF is currently the most reliable way to identify the NIC. Using hwloc and PCI BDF directly should avoid this issue for other vendors as well, and prevent breakage in the future.

The secondary issue is code conciseness and efficiency. The updated impl reduces hwloc version checks, and directly addresses Amir's concern, i.e.

The existing code in compare_cpusets assumed that some non_io ancestor of a
PCI object should intersect with the cpuset of the proc. However, this is
not true. There is a case where the non IO ancestor can be an L3. If there
exists two L3s on the same NUMA and the process is bound to one L3, but
the PCI object is connected to the other L3, then compare_cpusets() will
return false.

... without iterating over the entire graph to find out all nearby devices, i.e. if provider_list only has 2 devices then we do not need to calculate distances from others.

That said I might have missed some benefits from PMIx, if so please let me know.

@amirshehataornl
Copy link
Contributor

PMIx layer purpose, as far as I can tell, is to provide infrastructure to find the distances in a generic way. If there is a specific type of HW that's not supported, it would make more sense to resolve that at the PMIx layer. @rhc54, what are your thoughts on this?

@hppritcha hppritcha self-requested a review August 21, 2023 16:08
Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been a great deal of discussion here and valid points raised, but as far as being functional goes for the libfabric provider i'm interested in it works okay. so approving.

@wenduwan
Copy link
Contributor Author

@amirshehataornl Do you have additional comments on the current revision?

@rhc54
Copy link
Contributor

rhc54 commented Aug 21, 2023

Yes, basically here is the patch set I'm thinking of:

1. Fix a corner case in the original code where we wouldn't pick a near interface if one exists (this is this PR)

2. Add a NIC selection policy environment variable

3. use the NIC selection policy environment variable to select the nearest NIC

4. Once PMIx adds a way to find if a NIC and a GPU which share a bridge, then we can modify the code to use that for the GPU_DIRECT case.

We can do this in the same PR as a set of patches (which I think is cleaner, since all the patches will remain together instead of having them dispersed across history) or I can open a new PR for this.

thoughts?

Been spending time trying to understand the NIC-GPU association problem and finally am beginning to untangle it. Part of the confusion stems from the large changes in package composition over the last 10 years, which is accelerating as we speak. For example, the topology shown above by @wenduwan doesn't exist any more - stopped being shipped more than a decade ago. Instead, we have far more complex arrangements with multiple distinct PCI busses being attached to various points in each package.

I need to do a code audit and some testing, but I believe the PMIx distance calculation remains correct. However, it provides the distance between a given process and the devices on the node - it says nothing at all about device-to-device relationships. For the case of GDR, you want to select a NIC (or GPU) that has a specific relationship to a GPU (or NIC) - that cannot be done on the basis of distance.

In other words, for the GDR case, you want to get the distance to NICs that meet a specific condition (conjoined to one or more specific GPUs). You don't want to consider NICs that fail to meet that condition. Once the distances are returned, you would then select the one with the least distance. If no NIC meets the condition, then you get nothing back. You could then remove the condition and just take the NIC that is closest to you, noting that GDR is unavailable. Ditto for the reverse problem of getting distance to GPUs that are conjoined to specific NICs.

This shouldn't be too hard to provide within the existing APIs. Will try to work on it over the next week or two.

@wenduwan
Copy link
Contributor Author

I agree with @rhc54 's idea that we should separate 2 concerns in GDR case:

  • Filtering qualified devices
  • Select the optimal device from the qualified set

This is a good practice in general and I think we should follow this to implement GPU aware device selection.

@amirshehataornl
Copy link
Contributor

I'm already working on a PR that does the approach I mentioned above. Basically, the logic @rhc54 mentioned will be integrated under the GDR case. I'll push the PR once I have it tested and working. We have some NVIDIA machines local here, so I should be able to verify the GDR case, once it's available.

@wenduwan
Copy link
Contributor Author

I will test the patch on the new p5 platform to be bullet proof - it has extra NICs.
https://aws.amazon.com/ec2/instance-types/p5/

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits. I can approve the logic and implementation of the selection, but I don't have enough background to ensure there aren't hwloc subtleties we are missing.

static int get_provider_nic_pci(struct fi_info *provider, struct fi_pci_attr *pci)
{
if (NULL != provider->nic && NULL != provider->nic->bus_attr
&& provider->nic->bus_attr->bus_type == FI_BUS_PCI) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordering nit: FI_BUS_PCI == provider->.... to be consistent with NULL checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will address

Comment on lines 543 to 544
opal_output_verbose(1, opal_common_ofi.output, "Provider does not have PCI device: %s",
provider->domain_attr->name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log message is confusing. Reading just the error it's not clear if we care it has a PCI device, and it looks like we are searching for a PCI device named "efa" or something.

I suggest it should be more like: "Cannot determine device distance: Provider %s does not have PCI device."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool will rephrase.

* @return An array of device distances which are nearest this thread
* or NULL if we fail to get the distances. In this case we will just
* revert to round robin.
* @param[in] topoloy hwloc topolocy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos: topoloy and topolocy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will fix.

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2023

Could you perhaps send along the topology for that box? I'm always looking for new test topologies.

@wenduwan
Copy link
Contributor Author

Could you perhaps send along the topology for that box? I'm always looking for new test topologies.

Sure.

p5

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2023

Could I get the actual xml file so I can use it for testing?

@wenduwan
Copy link
Contributor Author

Could I get the actual xml file so I can use it for testing?

Ah got it. Does this one work for you?
p5.xml.zip

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2023

Ah got it. Does this one work for you?

Yep - thx!

This patch refactors the OFI NIC selection logic. It foremost improves
the NIC search algorithm. Instead of searching for the closest NICs on
the system, this patch directly compares the distances of the given
providers and selects the nearest NIC.

This change also makes it explicit that if the process is unbound, or
the distance cannot be reliably calculated, a provider will be selected
in round-robin fashion.

Signed-off-by: Wenduo Wang <[email protected]>

provider_rank = rank % num_nearest;
num_nearest = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a little nuance between num_nearest++ vs --num_nearest.

--num_nearest will result in NIC-3 being selected before NIC-2, i.e. NIC-3, NIC-2, NIC-1, NIC-0.

This is counter intuitive - therefore I inverted the order so that NIC-0 is selected before NIC-1, NIC-2, NIC-3. Just a little nicer to human brain.

@wenduwan
Copy link
Contributor Author

@lrbison I tested the change on p5 and made an adjustment to nic selection order. Please take a second look when you get a chance.

@lrbison
Copy link
Contributor

lrbison commented Sep 1, 2023

Minor changes from before, still looks good to me.

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 5, 2023

@amirshehataornl Could you give it another look? I addressed additional comments after your approval.

@wenduwan
Copy link
Contributor Author

@lrbison @hppritcha Haven't heard from Amir for a while. Good to merge?

@hppritcha
Copy link
Member

fine with me but i'd prefer to get @bwbarrett opinion at this point.

@wenduwan
Copy link
Contributor Author

@bwbarrett Could you PTAL?

@wenduwan
Copy link
Contributor Author

@hppritcha @lrbison Shall we merge the change?

@hppritcha hppritcha merged commit 5cb3094 into open-mpi:main Oct 31, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants