Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: add support for hwloc v2 #3302

Merged
merged 4 commits into from
Jul 20, 2017
Merged

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@bgoglin @rhc54 FYI it still does note compile with hwloc v2, and the things related to distance management is either wrong or missing

@bgoglin
Copy link
Contributor

bgoglin commented Apr 7, 2017

It compiles on my hwloc v2 here and it looks like you have ported to the new v2 distances API since your comment?

On the IO side, there are more changes. The IO_DEVICE topology flag is replaced with

hwloc_topology_set_io_types_filter(topology, HWLOC_TYPE_FILTER_KEEP_IMPORTANT);

(this enables PCI bridges, but we can filter more precisely if you want PCI and I/O devices without bridges).

Then, we need to check whether there is any place where you look for I/O objects in the children list. They moved to the dedicated I/O children list.

@ggouaillardet
Copy link
Contributor Author

@bgoglin regarding the new distances API, i made

  • something that compiles in hwloc
  • ifdef'ed out the code in btl/openib
    and i did not try yet to run even a simple MPI program
    i will keep improving this, thanks for the pointer to new IO devices API

@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2017

Something we are going to have to figure out is: how do we bring this in while retaining support for v1.x versions of HWLOC? We can't just switch over as many systems will not upgrade for quite some time.

If we ship an embedded version of v2.0, then we'll need an "ext1x" component that can seamlessly handle all the differences.

@ggouaillardet
Copy link
Contributor Author

well, on the bright side, the hwloc/external component is virtually unmodified, and can handle both hwloc v1 and v2 APIs
on the other side, there is a bunch of #ifdef HWLOC_API?_VERSION < 0x20000 in various places.
the way cache objects are handled in both API are pretty different and (at least to me) make it hard to abstract.
currently, this PR does not compile with an external hwloc based on the master branch (e.g v2.x),
and my targets are

  1. build successfully
  2. run sucessfully
  3. make it elegant and easy to maintain

@hppritcha
Copy link
Member

Is there a brief description somewhere of the pluses for switching to the 2.0 API?

@bgoglin
Copy link
Contributor

bgoglin commented Apr 7, 2017

Well, given that several changes were requested by OMPI devs, I hope you will use it in the end :)

There's no brief description yet. Major improvements are:

  1. L1CACHE, L2CACHE etc instead of a single CACHE type + a level attribute, so that you don't need cache-specific subcases everywhere. This was requested by @rhc54 and/or @jsquyres. It will simplify the OMPI code, but the transition isn't very easy. I wonder if you could convert types into depth earlier in the code path so that the CACHE subcase is only in the very first OMPI layers.
  2. Clean support for heterogeneous memories (KNL MCDRAM+DDR flat mode is currently strange because the MCDRAM is exposed without any local CPU). Still not in hwloc git master. Impact on OMPI is unclear.
  3. The distance API was significantly reworked for multiple poor design reasons (use of floats, internal management was messy, API not flexible enough, etc). Impact on OMPI should be OK (a single place #ifdef ?).

@ggouaillardet
Copy link
Contributor Author

i just pushed my latest changes. the PR now compiles with both hwloc API v1 and v2

@hppritcha
Copy link
Member

Will this approach work with external hwloc being either 2.0 or older ones?

@ggouaillardet
Copy link
Contributor Author

@hppritcha yes

@rhc54
Copy link
Contributor

rhc54 commented Jun 25, 2017

@ggouaillardet I need to move this along in prep for the HWLOC shared memory support, so I'll branch off of this point (if you don't mind) and submit something for your review.

@ggouaillardet
Copy link
Contributor Author

@rhc54 sure, please go ahead.
@bgoglin i guess we should all decide on how to move forward with this

  • do we simply stop embedding hwloc ?
  • do we embed hwloc v2 but as a truly standalone component (e.g. with its own autogen.sh and configure.ac
    if we go that way, do we embed the full hwloc or the embedded version (which has less files, but requires configury: enable autogen.sh from an embedded hwloc dist tarball hwloc#235 so autogen.sh can be ran from an ompi tarball) ?
    iirc, i did not finalize this PR, especially the part related to hardware/topology distances
    (which was moved from float to int)

@rhc54
Copy link
Contributor

rhc54 commented Jun 26, 2017

I'll raise the "do we stop embedding hwloc" question on the telecon this week. IIRC, we did hash that around a bit, but I don't think we ever reached conclusion - I think hwloc wasn't included in some really old distros (e.g., RHEL 5?). Of course, that shouldn't stop somebody from installing it.

@ggouaillardet
Copy link
Contributor Author

do you expect hwloc v2 in the ompi v3 branch ? or in the future v4 branch ?
iirc, we concluded we need to embed hwloc for now.
that being said, if the target is Open MPI 4, i guess we can anticipate hwloc will be virtually available everywhere by then.

@rhc54
Copy link
Contributor

rhc54 commented Jun 26, 2017

strictly v4, I think - I can't imagine porting it to v3

@rhc54
Copy link
Contributor

rhc54 commented Jun 27, 2017

I added the embedding question to the topic list for the July meeting

@bgoglin
Copy link
Contributor

bgoglin commented Jul 17, 2017

@ggouaillardet Can you clarify why allowed_cpuset and the WHOLE_SYSTEM flag isn't used anymore? I always wondered why it was used, so what's the reason for removing it in this hwloc 2.0 update and not earlier?

I am actually thinking of removing the allowed_cpuset field from objects in 2.0 and just have a single topology-wide allowed_cpuset. People using the WHOLE_SYSTEM flag would have to look at and(obj->cpuset,topology->allowed_cpuset) for knowing whether obj is allowed. That's part of larger shrinking that will more likely remove obj->complete_cpuset first.

@rhc54
Copy link
Contributor

rhc54 commented Jul 17, 2017

IIRC, the flag was used because we were passing in an xml string that represented the entire system, not just the portion we had been allocated. It has been a long time, though, so I might be mistaken.

@rhc54
Copy link
Contributor

rhc54 commented Jul 17, 2017

FWIW: I'm not convinced it should have been removed. Regardless, you simplification removes any need for it.

@ggouaillardet
Copy link
Contributor Author

@bgoglin i checked the code and basically we currently use WHOLE_SYSTEM and then manually filter out the unallowed cpusets with allowed_cpuset everywhere.
so not using WHOLE_SYSTEM remove the need to filter out some cpus, and hence simplify the code.

but as pointed by @rhc54, i did not check what happens if we pass a string/file with some cpus that are not allowed.

anyway, i will fix the conflicts from now

@rhc54
Copy link
Contributor

rhc54 commented Jul 19, 2017

Wait - what? I already fixed the conflicts and have this pretty much ready to go. What are you planning to do now?

@ggouaillardet
Copy link
Contributor Author

so we did the job twice ...
you previously said you are working in your own branch, but i have no visibility on it.
can you then please make a new PR, add a reference and then close this one.
btw, at the devel meeting, was there a consensus about embedding hwloc or not ?

@rhc54
Copy link
Contributor

rhc54 commented Jul 19, 2017

My apologies - it is pushed to my OMPI repo. I just didn't file a PR as I was waiting for the shmem changes. However, I can add that separately, so why don't you go ahead and just commit this?

We decided to continue embedding for now, perhaps changing the selection logic to take a suitable external version (if found in the standard locations) over the internal one. For now, I just want to get 2.0 working along with the HWLOC shmem support, so let's focus on that.

@ggouaillardet
Copy link
Contributor Author

no problem, i will update hwloc2x to the latest master and merge this PR tomorrow

@bgoglin
Copy link
Contributor

bgoglin commented Jul 19, 2017

The hwloc API still isn't ready yet, don't merge this PR yet. Hopefully by the end of summer.

@rhc54
Copy link
Contributor

rhc54 commented Jul 19, 2017

@bgoglin If it passes our MTT tests, then why not merge it? There are a lot of HWLOC APIs, and we only use a handful, so there is unlikely to be any conflict. If we do get some later on, we can always update at that time.

Merging this sooner would allow me to do the shmem integration, which is why I'm asking.

@bgoglin
Copy link
Contributor

bgoglin commented Jul 20, 2017

My concern was that the last missing API change is quite intrusive (it moves NUMA nodes out of the main tree, they will now be special "memory" children of objects).

If you use the basic API like get_obj_by_type or depth (which seems to be case from what I see with a quick grep), no problem.

If you use non-basic APIs (maybe like get_obj_inside_cpuset_by_type), I am not 100% sure yet if I'll be able to make them work transparently for NUMA nodes.

If you manually traverse the tree using pointers, you'll need to update the code. Main example: If you ever look for your local NUMA node by walking up parent object pointers until you find a NUMA node object, you won't find it anymore (you'll need to look at parent "memory" children instead of parent).

If that's fine, go ahead and merge it.

checks related to offline and disallowed elements

Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
internal hwloc 2x is used with --with-hwloc=future

Signed-off-by: Gilles Gouaillardet <[email protected]>
@rhc54
Copy link
Contributor

rhc54 commented Jul 20, 2017

@bgoglin Thanks for the explanation. I don't see a problem with merging. If the API change impacts us, we'll just update when we pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants