-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal/ofi: refactor NIC selection logic #11689
Conversation
@amirshehataornl Could you please review this change? |
Requested second opinions from @hppritcha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second pass, some new thoughts, sorry for the double-review.
71a9977
to
49fcd44
Compare
No preference here. It works in either case since the function is only used in |
I have observed an unrelated segfault in opal with old hwloc 1.11.8. Will look at that separately.
|
I guess my point is, don't we want Or maybe it's a question: could it? I didn't see any reason we need pci data for that case. |
Maybe not so much for this chagne. But it will become more obvious with my upcoming change wrt accelerator awareness, which actually uses We could experiment with removing |
opal/mca/common/ofi/common_ofi.c
Outdated
return true; | ||
|
||
if (fi_info && fi_info->fabric_attr && fi_info->fabric_attr->prov_name | ||
&& 0 == strcasecmp("efa", fi_info->fabric_attr->prov_name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh; I don't understand this comment or why we're using ibv information here; let's step back and assume we've gone wrong somewhere and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwbarrett Oh do you mean the NIC selection logic is ambiguous?
The current logic is:
- Find the list of nearest NICs relative to the current process' package
- Enumerate over the candidate NICs, and determine which ones are in the above list
- Select 1 NIC from the list based on (rank on package) % (#nearest NICs)
What went wrong here is in 2) determine which ones are in the above list
- the code was attempting to match NodeGUID
&SysImageGUID
, both of which are 0's for EFA.
Or did I not understand your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess I don't understand why we're using NodeGUID and SysImageGUID for that at all, or why they're 0. BUt I object to there being EFA-specific code here, as there's nothing about EFA that should make it need special code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It was introduced in this change d4e1ae5
I guess Amir had a reason to check the guid values for his platform. Since I can only speak for EFA, could @amirshehataornl chime in to the rationale of the change?
In general I agree with Brian's assessment. If we don't need guids I can attempt a more generic/simpler logic, e.g. comparing PCI BDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwbarrett I have refactored the logic that's incompatible with EFA, such that we do not need to call out EFA(or other vendor) in particular.
@amirshehataornl I reckon this conflicted with your previous refactor. Could you take a look and test it out on your system? FWICT it also addresses your previous concern with cpuset intersection with the wrong L3 cache.
Confirmed that this is no-issue. Gotta make clean between switching from internal hwloc to external. |
49fcd44
to
607b6c5
Compare
607b6c5
to
fa8d042
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving away from using PMIx distances and re-implementing the code using hwloc directly.
Can you please give a justification for not using PMIx distances? Does this implementation give us something not provided via PMIx?
I thought this was intended to deal only with the case where PMIx wasn't providing the device distance. However, looking at what is now in the PR, I have to agree with @amirshehataornl questions - why was the PMIx code removed??? |
@amirshehataornl @rhc54 After syncing with @bwbarrett I realized that there are 2 issues with the prior implementation. The primary concern was around the use of:
Uniqueness of uuid and osname is vendor-specific, which led me to introduce additional logic to handle EFA case in the 1st iteration. Come to think about it, PCI BDF is currently the most reliable way to identify the NIC. Using hwloc and PCI BDF directly should avoid this issue for other vendors as well, and prevent breakage in the future. The secondary issue is code conciseness and efficiency. The updated impl reduces hwloc version checks, and directly addresses Amir's concern, i.e.
... without iterating over the entire graph to find out all nearby devices, i.e. if That said I might have missed some benefits from PMIx, if so please let me know. |
PMIx layer purpose, as far as I can tell, is to provide infrastructure to find the distances in a generic way. If there is a specific type of HW that's not supported, it would make more sense to resolve that at the PMIx layer. @rhc54, what are your thoughts on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's been a great deal of discussion here and valid points raised, but as far as being functional goes for the libfabric provider i'm interested in it works okay. so approving.
@amirshehataornl Do you have additional comments on the current revision? |
Been spending time trying to understand the NIC-GPU association problem and finally am beginning to untangle it. Part of the confusion stems from the large changes in package composition over the last 10 years, which is accelerating as we speak. For example, the topology shown above by @wenduwan doesn't exist any more - stopped being shipped more than a decade ago. Instead, we have far more complex arrangements with multiple distinct PCI busses being attached to various points in each package. I need to do a code audit and some testing, but I believe the PMIx distance calculation remains correct. However, it provides the distance between a given process and the devices on the node - it says nothing at all about device-to-device relationships. For the case of GDR, you want to select a NIC (or GPU) that has a specific relationship to a GPU (or NIC) - that cannot be done on the basis of distance. In other words, for the GDR case, you want to get the distance to NICs that meet a specific condition (conjoined to one or more specific GPUs). You don't want to consider NICs that fail to meet that condition. Once the distances are returned, you would then select the one with the least distance. If no NIC meets the condition, then you get nothing back. You could then remove the condition and just take the NIC that is closest to you, noting that GDR is unavailable. Ditto for the reverse problem of getting distance to GPUs that are conjoined to specific NICs. This shouldn't be too hard to provide within the existing APIs. Will try to work on it over the next week or two. |
I agree with @rhc54 's idea that we should separate 2 concerns in GDR case:
This is a good practice in general and I think we should follow this to implement GPU aware device selection. |
I'm already working on a PR that does the approach I mentioned above. Basically, the logic @rhc54 mentioned will be integrated under the GDR case. I'll push the PR once I have it tested and working. We have some NVIDIA machines local here, so I should be able to verify the GDR case, once it's available. |
I will test the patch on the new p5 platform to be bullet proof - it has extra NICs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small nits. I can approve the logic and implementation of the selection, but I don't have enough background to ensure there aren't hwloc subtleties we are missing.
opal/mca/common/ofi/common_ofi.c
Outdated
static int get_provider_nic_pci(struct fi_info *provider, struct fi_pci_attr *pci) | ||
{ | ||
if (NULL != provider->nic && NULL != provider->nic->bus_attr | ||
&& provider->nic->bus_attr->bus_type == FI_BUS_PCI) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordering nit: FI_BUS_PCI == provider->
.... to be consistent with NULL checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks will address
opal/mca/common/ofi/common_ofi.c
Outdated
opal_output_verbose(1, opal_common_ofi.output, "Provider does not have PCI device: %s", | ||
provider->domain_attr->name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this log message is confusing. Reading just the error it's not clear if we care it has a PCI device, and it looks like we are searching for a PCI device named "efa" or something.
I suggest it should be more like: "Cannot determine device distance: Provider %s does not have PCI device."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool will rephrase.
opal/mca/common/ofi/common_ofi.c
Outdated
* @return An array of device distances which are nearest this thread | ||
* or NULL if we fail to get the distances. In this case we will just | ||
* revert to round robin. | ||
* @param[in] topoloy hwloc topolocy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos: topoloy and topolocy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks will fix.
Could you perhaps send along the topology for that box? I'm always looking for new test topologies. |
Could I get the actual xml file so I can use it for testing? |
Ah got it. Does this one work for you? |
Yep - thx! |
This patch refactors the OFI NIC selection logic. It foremost improves the NIC search algorithm. Instead of searching for the closest NICs on the system, this patch directly compares the distances of the given providers and selects the nearest NIC. This change also makes it explicit that if the process is unbound, or the distance cannot be reliably calculated, a provider will be selected in round-robin fashion. Signed-off-by: Wenduo Wang <[email protected]>
55c51ad
to
f5f3b93
Compare
|
||
provider_rank = rank % num_nearest; | ||
num_nearest = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a little nuance between num_nearest++
vs --num_nearest
.
--num_nearest
will result in NIC-3
being selected before NIC-2
, i.e. NIC-3
, NIC-2
, NIC-1
, NIC-0
.
This is counter intuitive - therefore I inverted the order so that NIC-0
is selected before NIC-1
, NIC-2
, NIC-3
. Just a little nicer to human brain.
@lrbison I tested the change on p5 and made an adjustment to nic selection order. Please take a second look when you get a chance. |
Minor changes from before, still looks good to me. |
@amirshehataornl Could you give it another look? I addressed additional comments after your approval. |
@lrbison @hppritcha Haven't heard from Amir for a while. Good to merge? |
fine with me but i'd prefer to get @bwbarrett opinion at this point. |
@bwbarrett Could you PTAL? |
@hppritcha @lrbison Shall we merge the change? |
This patch refactors the OFI NIC selection logic. It foremost improves
the NIC search algorithm. Instead of searching for the closest NICs on
the system, this patch directly compares the distances of the given
providers and selects the nearest NIC.
This change also makes it explicit that if the process is unbound, or
the distance cannot be reliably calculated, a provider will be selected
in round-robin fashion.