Meeting Minutes 2018 09

Open MPI fall/winter 2018 developer meeting

(yes, the "2018 09" title on this wiki page is a bit of a lie -- cope)

9am, Tuesday, Oct 16, 2018 - noonish Thursday, Oct 18, 2018.
Cisco buildings 2 and 3 (right next to each other), San Jose, CA, USA.
- Tuesday: Cisco building 2 (Google Maps link)
  - NOTE 1: The Tuesday meeting is immediately after the weekly Webex. You're welcome to show up after 7:30am US Pacific to be in the Cisco conference room for the weekly Webex.
  - NOTE 2: There is no Lobby Ambassador in Cisco building 2. You need to iMessage or SMS text Jeff Squyres, and I'll come escort you to the meeting room.
- Wednesday, Thursday: Cisco building 3 (Google Maps link)
  - There is a Lobby Ambassador in Cisco building 3; they will alert me when you arrive (but iMessaging or SMS texting me wouldn't hurt, either).

Attendees

Please put your name on this list if you plan to attend any/all of the meeting so that Jeff can register you for a Cisco guest badge+wifi.

Jeff Squyres, Cisco
Andrew Friedley, Intel
Geoff Paulsen, IBM
Howard Pritchard, LANL
Edgar Gabriel, UH
Shinji Sumimoto, Fujitsu
Thomas Naughton, ORNL
Matias Cabral, Intel (16th only)
Neil Spruit , Intel (16th only)
Brian Barrett, Amazon (16th only)
Akshay Venkatesh, NVIDIA
Xin Zhao, Mellanox
Artem Polyakov, Mellanox
George Bosilca, UTK
Arm Thananon Patinyasakdikul, UTK

Agenda items

Tues Oct 16 (to accommodate limited availability)

OFI (Libfabric):

Intel presenting some slides on OFI MTL
- (pronounced Oh_Ef_Eye, or Ooh_Fee, or Oh_Fie)
- Added support for remote CQ data.
  - Not all providers support remote CQ data.
- Affects MPI_Send/Isend/Irecv/Recv/Improbe/Iprobe
  - Now conditionals in there.
- Scalable endpoints support in OFI MTL
- Registering specialized communication functions based on provider capabilities
- m4 generated C code to avoid code duplication?
- Discussion: OFI components to set their priority based on the provider found.
- OFI Common module creation.
Conversation:
- Libfabric has to improve: The idea that all libfabric providers don't support all features is bad for libfabric users of the interface.
- Libfabric should better set expectations of what's needed.
- This is still useful feature, but libfabric api needs to be better.
slide 5 and 6:
- Possible Solutions to generate provider specific implementation.
- CPP Macros multiplies quickly via many arguments. Very fragile
- Expand m4 configuration step.
Converstaion:
Today, don't require m4 to BUILD Open MPI, just to make dist.
- Version of m4 is also an issue....
- Suggestion, don't generate this C code at make time, but generate ALL C files at Make dist time, and only compile some of them at build time.
- Make dist time also has perl and python.
- No fallback to build slowly.
- Will make ALL of the providers at make dist time, and will setup function pointers at runtime.
- Jeff jokingly suggested that they could implement it with Patcher to get rid of function pointer indirection once the provider is known.
Providers would implement these pieces themselves.
TCP / Sockets reference may need an owner.
Does it make sense to NOW change OFI from an MTL to a PML?
- Mellanox first did MTL, but then switched to PML.
- Intel made CM better, so difference is now much smaller.
- Intel thinks about 100 ns penalty from MTL rather than PML.
  - Benchmarking this is important, because much of CM layer would just need to push some of this functionality down to PML.
- Most are okay if it's an MTL or a PML, but don't want extra work.
- A resource issue now.

Scalable Endpoints - OFI SEPs

Expose hardware resources through libfabric APIs.
Thread -> ctx assignment.
This feature enables exclusive transmit/receive contexts per MPI Comm.
- Assume each thread has a paired thread for each rank in Comm.
- Contexts are shared when run out of hardware limit.
- Code currently in github branch: https://github.com/aravindksg/ompi/tree/ofi_sep
OFI MTL by communicator ID.
- Ongoing work to add "on first access"
OFI BTL on first access (or round robin)
- TLS used to save OFI context used by specific thread (round robin without TLS)
- Forces own progress once an empirically set threshold exceeeds.
Both of the above only makes sense with multi-threaded opal_progress
- ARM has slides and will talk about this later today.
Thread grouping
OFI MTL Benchmarked with IMB-MT (available with Intel MPI 2019 Tech Preview)
OFI BTL Benchmarked different usage with ARM's "Pairwise RMA" benchmark on thannon 's github fork
- Each thread running on it's own MPI Communicator.
- Open MPI has lots of thread unsafe code components. Could flag at runtime if opal_progress is thread safe or not (and either funnel or not).
- ARM will discuss.
What are assumptions about endpoints? Are all connectionless. Any Endpoint can talk to any endpoint.
Only broadcast one endpoint, but address relative contexts
Problems (slide 14)
- Multiple OFI components "may" inefficiently use resources.
- Today without addressing Issue 5599, BTLs are always initialized
- Suggestion: Portals library does this, look at vector addring.
Dicsussions about packaging, how to get instances to share common data structures. OFI Common module.
- Different policies for resource sharing: EP, CQ, AV.
  - policies will be "share endpoints", or each component create their own.
OFI Components Dynamic Priority (slide 16) Issue 5794 - MTL will discard default currently known "slow" providers.

TCP on master right now.

Cisco and amazon are seeing tests fail "socket closed"
A patch about a month ago started treating a socket close as a hard error and aborting the job.
Just need to apply one of two patches.
1. One back off the hard error in shutdown condition.
2. Just revert the patch that broke everything. *. As a community we should do one or the other. But this just hides the real problem.
TCP patch just exposed a bigger problem.
- TCP BTL doesn't have a connection and handshake.
- Second PML calls abort, but during finalize perhaps it shouldn't abort.
- PML should know about it's BTLs going away.
- No longer a fence in Finalize.
We all agree that we should apply the same logic everywhere
- If a commit breaks master, we should revert it, or disable it
- Doesn't shut the door to better fix / redesign later.

Lunch

Multi-threaded OPAL Progress

PR 5241: Add MCA param for multithread opal_progress() (George, Arm)
- https://github.com/open-mpi/ompi/pull/5241
Arm presented some graph about ideas around multi-threaded opal_progress()
Old design still capped at single thread.
- luck based performance (threads are undeterministic)
- Current design performs well in oversubscribed case.
proposed design would be to move thread protection down to component level.
- let every thread call opal_progress() to get the benefit of threads.
- can have some thread go through progress, and can have some in thread pool to do tasks.
1024 bytes +35% injection rate (shines at small msg, why not 1byte)
What can be taskify?
- Matching
- Pack, unpack
- Anything that does not need to be in critical section.
Downsides
- Task creation overhead.
- If only one thread waiting, more overhead/latency
What we need from components:
- They have to be thread safe (or disqualify with THREAD_MULTIPLE)
Components should not use lock in component_progress()
- For those not safe, can just use trylock instead.
- Using lock
Out of sequence impact
- receiving
- MPI has a way to tell MPI not to enforce ordering
  - Once you go multi-threaded, most of the time is then spent ordering.
  - can get 100% speed up if app doesn't need ordering (app handles tags themselves).
  - RDMA (bigger than FIFO), stalls pipelines, really bad.
MADNESS performance (C++, bad mem management)
- 5 times slower on openib btl
- RDMA path lose a lot of performance because of out of order seq.
We need to better educate users to use multiple communicators.
- Is this because of Communicator lock contention or seperate match space?
  - Both.
What do we do today?
- btl modules needs lock to protect it's progress.
All agree that mt_opal_progress makes sense, in hopes to get better injection rate.
- Looking for proposal on 'core' locking around btl progress functions.
What about the single threaded?
- Will affect, but should be a single thread through a trylock (~0)
- With this functionality, single thread can easily have multiple progress threads.
WORK Phases:
1. mca param # threads allowed in opal_progress, default: 1
- Also means changing opal_progress() to be multi-threaded.
- ARM will PR
1. Arm provides recommendation(s)
2. maintainers update components w giant trylock().
  - Any component that registers a progress function needs this trylock()
  - libnbc as well? (libnbc already does this sort of)
  - Everyone should do this, should be very simple.
3. maintainers update components to make use of multiple contexts.
4. Argue about default of mca param > 1
End Goal:
- If components want to take advantage of multi-threaded progress, they can do a bit more work (3 lines)
- Make

5599 - BML initialization

r2 / BTLs are initialized even when they are not used (Jeff)
- https://github.com/open-mpi/ompi/issues/5599
- Conclusion, there's not a great solution for NOT initializing the BTLs because of one-sided.
  - If we did this we need an answer for what BTL to use when we get there. Right now only portal UCX have an answer for that.
- Also not a straightforward way for multiple components to have OFI have one endpoint be shared across multiple components.
  - Ralph says we DO have a way to Allow the OFI MTL component to let the OFI BTL component know the other is using a certain resource. Could set a PMIx event that the other could catch.
- either extent libfabric / libpsm2, or write a common component between ompi and opal.

4.0.x status / roadmap

RC tomorrow
Release Monday!
Nathan just filed 5889 blocker.

TCP bric-a-brac:

Discuss TCP multilink support. What is possible, what we want to do and how can we do it.
- Amazon signed up to do this, but might not happen until Q1 2019
Discuss TCP multiple IP on the same interface. What we want to do, and how we plan to do it.
- Right now if you have both IPv4 AND IPv6, only publish the IPv6 if they have IPv6 enabled on all nodes.
- Everyone's happy with this current behavior.
TCP BTL progress thread. Does 1 IP interface vs. >1 IP interface matter?
- Discussed this morning.

C compiler discussion

ISSUE: Vader hit some bugs where compilers were ripping a 32bit write into two different 16bit writes.
If you read the linux doc on evil things the compilers do.
Linux has some macros WRITE_ONCE, READ_ONCE, ACCESS_ONCE. But the macro only works for gcc (> v4.1), llvm, and/or intelcc
- Not sure if this is just a temporary measure until we have C11 compilant compilers.
Should we limit the number of C compilers that can be used to compile the OMPI core (e.g., limit the amount of assembly/atomic stuff we need to support).
- E.g., PGI doesn't give us the guarantees we need
- Probably need to add some extra wrapper glue: e.g., compile OMPI core with C compiler X and use C compiler Y in mpicc.
  - Are there any implications for Fortran? Probably not, but Jeff worries that there may be some assumption(s) about LDFLAGS/LIBS (and/or other things?) such that: "if it works for the C compiler, it works for the Fortran compiler".
Decided to require compilers that can correctly guarantee WRITE_ONCE, READ_ONCE, and ACCESS_ONCE macro
xlC does not guarantee and so can't compile Open MPI core.
Brian and Nathan will sort out who will do this work on master for next release from master.
Will clearly state which dirs to be compiled with core, and what a compiler needs to do to comiple the 'core'.
essentially configurey will ask fore 'core' compiler and 'main' compiler.

PMIx as "first class" citizen?

Shall we remove the OPAL pmix framework and directly call PMIx functions?
- Require all with non-PMIx environments to provide a plugin that implements PMIx functions with their non-PMIx library
- In other words, invert the current approach that abstracted all PMIx-related interfaces.
Why does Open MPI need to know about pmix server code?
- Only reason it's here, is because of the OPAL abstraction layer.
- Coming into the abstraction layer, because you're coming in with Opal types, and need to convert to pmix APIs.
Howard thought we'd already decided to just call PMIx directly.
And that this would be part of the same issue.
In the absence of ORTE, then have to download and use install PRTE and build it and do a LAMBOOT style.
LAMBOOT - could be buried under an mpirun like wrapper.
Engineering efforts?
1. Making mpirun hide PRTE - Ralph says it's pretty trivial (he has something similar in PRTE)
2. Would have a seperate "project" for this, and a number of ways to do this.
It is another piece with a different release schedule, and with different release goals. May be broader than what Open MPI needs.
If PMIx adds a new
New features coming down the road:
- Groups (part of sessions)
- Networking support in PMIx, but dont have a way to take advantage of in OMPI.
- Containers.
About 4 or 5 things in next 6 months.
Ralph is having a talk at super computing about user's applications using PMIx directly. And more users will want to use this.
Is there a way a user who linked against pmix intercept our MODEX?
MODEX is special, but nothing stops a user from also getting a callback on ceratin PMIx events.
If we get rid of opal wrappers that call PMIx, does that concern us for future maintainability?
- No, that was done back when we had to support multiple things, PMI1 and PMI2, and SLURM, etc.
If we did this, we the Open MPI community would expect PMIx to behave as an interface, like we do with MPI interface.
- PMIx has made that promice, and now this is why they have PMIx standard.
Following the shared library versioning, and API promices PMIx is then in same boat as hwloc, libevent, etc.
Why not make hwloc and libevent 1st class citizens.
Talk about getting rid of libevent.
- Next time do the work for libevent, we might consider move it UP to top level 3rdLevel directory.
- Take a look at opal hwloc in PRTE.
  - We don't have an hwloc object, it's just a name translation.
  - We just call opal_hwloc...
  - part we have to retain, we define binding policies in
  - PRTE doesn't have imbedded hwloc. All the base functions are pulled into opal_hwloc
- Ralph also has an opal event directory in PRTE.
- In PRTE there is no abstractions
- Oh wait, There IS and opal_pmix in PRTE, since there's some conversions in there.
  - Howard is taking a look at this.
Discussion expanded from just PMIx to also direct calling hwloc and libevent.
- Orte pushes events into libevent as a way to sequence them. In a very large system, if you don't do things correctly, you can see this.
  - PMIx has been getting out of that habit and so everything's okay.
hwloc not much work.
pmix side would need to do translation. (Error codes can't get rid of, no way we can line up those error codes).
- PMIx_info_ts - nice to get rid of doing these translations.
- Easy for Ralph to update, since he's already done this in PRTE.
Difficulty - How do you deal with differences in version levels?
- Easy for compile time differences. It either builds or doesn't.
- Runtime differences in PMIx support is more difficult.
- Example: Build OMPI against PMIx v4.0, but then RUN with v3.0 Linker should fail at runtime, if .so versioning is done correctly.
What do we do with SLRUM PMIx (16.05 first), and ALPS PMI?
- Is there one for ALPs? No, but no reason they couldn't implement PMIx. They will support the launch mechnaism at least.
If we do this for OMPI v5.0, and you still want to run under an older SLURM, user COULD launch PRTE_BOOT in their SLURM PROPOSAL:

Remove ORTE
Remove RTE layer?
Add PRTE
Modify mpirun / prterun - see if there's an existing PMIx server, if not auto PRTE_BOOT the PRTE daemons, and then launch.

PRTE would HAVE to go to formal release
Any components in ORTE not in PRTE.
QUESTION: could mpirun act like a wrapper around srun?
- No, but we could look at that.
- Does SLURM support spawn?
- If it did, could just call PMIx_Spawn, and let slurm launch the PMIx daemon.
This will destroy years of power point slides.
OMPI layer will just call PMIx.
Not THAT much work, because we don't have to change the ORTE side.
Howard is going to remove bfo component (removes a trouble spot)
Howard (ECP) in January timeframe.
Are we going to redistribute PRTE?
- Yes, It's got same license as Open MPI.
Ralph moves hwloc and libevent up.
- Some things that needs translation
- But Don't do name shifting.
- Want to see where the glue goes.
It will still check external first. and then build the internal pieces that don't exist.

=====

To Be Scheduled

5.0.x roadmap
PMIx Roadmap
- Review of v3 and v4 features
- Outline changes in OMPI required to support them
- Outline changes for minimizing footprint (modex pointers instead of copies)
- Decide which features OMPI wants to use
ORTE support model
- See https://docs.google.com/document/d/1VwqUVAhkeJt7PmaQCBQpUXBYdy9Elg5m-TBzY-6lLQY/edit
- Should we remove ORTE from the Open MPI repo and make it a separate project?
- How do we resolve the "one package" philosophy we have embraced from day one?
Should we publish Open MPI release tarballs to https://github.com/open-mpi/ompi/releases?
- Per https://github.com/open-mpi/ompi/issues/5604, I posted a bunch of "Official releases aren't here..." on the github releases page.
Mail service (mailman, etc.) discussion - here are the lists we could consolidate down to:
- OMPI core
- OMPI devel/packagers
- OMPI users/announce
- OMPI commits
- HWLOC commits
- HWLOC devel/users/announce
- MTT users/devel
- Do we want to move to a commercial hosting site? Would cost about $21/month, or about $250/year
  - https://www.mailmanhost.com/pricing/
- Additionally: The Mail Archive is going through some changes. Unclear yet as to whether this will impact us or not
  - https://www.mail-archive.com/[email protected]/msg01586.html
- As of 4 Sep 2018, Open MPI has $575 in our account at SPI.
public ompi-tests repository for easier sharing of testsuites among collaborators (Edgar)
Remove orte-dvm and redirect users to PRRTE? (Ralph)
Ralph+Jeff: discuss PMIx compatibility issues and how to communicate them
- https://pmix.org/support/faq/how-does-pmix-work-with-containers/ addresses (some of) PMIx-to-PMIx compatibility issues.
- But what about OMPI to PMIx compatibility?
- And what about RM to PMIx compatibility?
- How do we convey what this multi-dimensional variable space means to users in terms of delivered MPI features?
- Case in point: https://github.com/open-mpi/ompi/issues/5260#issuecomment-421407400 (OMPI v3.0.x used with external PMIx 1.2.5, which resulted in some OMPI features not working).
Discuss memory utilization/scalability (ThomasN)
Debugger transition from MPIR to PMIx
- How to orchestrate it?
ABI-changing commit on master (after v4.0.x branch) which will affect future v5.0.x branch: https://github.com/open-mpi/ompi/commit/11ab621555876e3f116d65f954a6fe184ff9d522.
- Do we want to keep it? It's a minor update / could easily be deferred.
Nathan/Brian: Vader bug cleanups
- Want to strengthen the recent vader fixes to be fully bulletproof
Need vendors to reply to their issues on the github issue tracker
Mellanox/Xin: Performance optimization on OMPI/OSC/UCX multithreading
Fujitsu's status
- Fujitsu MPI for Post-K Computer
- Development Status in Fujitsu
- QA Activity in Fujitsu
openib: persistent error reported by multiple users
- https://github.com/open-mpi/ompi/issues/5914
- Feels like this should be easy to fix...?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly