-
Notifications
You must be signed in to change notification settings - Fork 1
Meeting Minutes 2018 09
(yes, the "2018 09" title on this wiki page is a bit of a lie -- cope)
- 9am, Tuesday, Oct 16, 2018 - noonish Thursday, Oct 18, 2018.
- Cisco buildings 2 and 3 (right next to each other), San Jose, CA, USA.
- Tuesday: Cisco building 2 (Google Maps link)
- NOTE 1: The Tuesday meeting is immediately after the weekly Webex. You're welcome to show up after 7:30am US Pacific to be in the Cisco conference room for the weekly Webex.
- NOTE 2: There is no Lobby Ambassador in Cisco building 2. You need to iMessage or SMS text Jeff Squyres, and I'll come escort you to the meeting room.
- Wednesday, Thursday: Cisco building 3 (Google Maps link)
- There is a Lobby Ambassador in Cisco building 3; they will alert me when you arrive (but iMessaging or SMS texting me wouldn't hurt, either).
- Tuesday: Cisco building 2 (Google Maps link)
Please put your name on this list if you plan to attend any/all of the meeting so that Jeff can register you for a Cisco guest badge+wifi.
- Jeff Squyres, Cisco
- Andrew Friedley, Intel
- Geoff Paulsen, IBM
- Howard Pritchard, LANL
- Edgar Gabriel, UH
- Shinji Sumimoto, Fujitsu
- Thomas Naughton, ORNL
- Matias Cabral, Intel (16th only)
- Neil Spruit , Intel (16th only)
- Brian Barrett, Amazon (16th only)
- Akshay Venkatesh, NVIDIA
- Xin Zhao, Mellanox
- Artem Polyakov, Mellanox
- George Bosilca, UTK
- Arm Thananon Patinyasakdikul, UTK
- Intel presenting some slides on OFI MTL
- (pronounced Oh_Ef_Eye, or Ooh_Fee, or Oh_Fie)
- Added support for remote CQ data.
- Not all providers support remote CQ data.
- Affects MPI_Send/Isend/Irecv/Recv/Improbe/Iprobe
- Now conditionals in there.
- Scalable endpoints support in OFI MTL
- Registering specialized communication functions based on provider capabilities
- m4 generated C code to avoid code duplication?
- Discussion: OFI components to set their priority based on the provider found.
- OFI Common module creation.
- Conversation:
- Libfabric has to improve: The idea that all libfabric providers don't support all features is bad for libfabric users of the interface.
- Libfabric should better set expectations of what's needed.
- This is still useful feature, but libfabric api needs to be better.
- slide 5 and 6:
- Possible Solutions to generate provider specific implementation.
- CPP Macros multiplies quickly via many arguments. Very fragile
- Expand m4 configuration step.
- Converstaion:
- Today, don't require m4 to BUILD Open MPI, just to make dist.
- Version of m4 is also an issue....
- Suggestion, don't generate this C code at make time, but generate ALL C files at Make dist time, and only compile some of them at build time.
- Make dist time also has perl and python.
- No fallback to build slowly.
- Will make ALL of the providers at make dist time, and will setup function pointers at runtime.
- Jeff jokingly suggested that they could implement it with Patcher to get rid of function pointer indirection once the provider is known.
- Providers would implement these pieces themselves.
- TCP / Sockets reference may need an owner.
- Does it make sense to NOW change OFI from an MTL to a PML?
- Mellanox first did MTL, but then switched to PML.
- Intel made CM better, so difference is now much smaller.
- Intel thinks about 100 ns penalty from MTL rather than PML.
- Benchmarking this is important, because much of CM layer would just need to push some of this functionality down to PML.
- Most are okay if it's an MTL or a PML, but don't want extra work.
- A resource issue now.
- Expose hardware resources through libfabric APIs.
- Thread -> ctx assignment.
- This feature enables exclusive transmit/receive contexts per MPI Comm.
- Assume each thread has a paired thread for each rank in Comm.
- Contexts are shared when run out of hardware limit.
- Code currently in github branch: https://github.com/aravindksg/ompi/tree/ofi_sep
- OFI MTL by communicator ID.
- Ongoing work to add "on first access"
- OFI BTL on first access (or round robin)
- TLS used to save OFI context used by specific thread (round robin without TLS)
- Forces own progress once an empirically set threshold exceeeds.
- Both of the above only makes sense with multi-threaded opal_progress
- ARM has slides and will talk about this later today.
- Thread grouping
- OFI MTL Benchmarked with IMB-MT (available with Intel MPI 2019 Tech Preview)
- OFI BTL Benchmarked different usage with ARM's "Pairwise RMA" benchmark
on thannon 's github fork
- Each thread running on it's own MPI Communicator.
- Open MPI has lots of thread unsafe code components. Could flag at runtime if opal_progress is thread safe or not (and either funnel or not).
- ARM will discuss.
- What are assumptions about endpoints? Are all connectionless. Any Endpoint can talk to any endpoint.
- Only broadcast one endpoint, but address relative contexts
- Problems (slide 14)
- Multiple OFI components "may" inefficiently use resources.
- Today without addressing Issue 5599, BTLs are always initialized
- Suggestion: Portals library does this, look at vector addring.
- Dicsussions about packaging, how to get instances to share common
data structures. OFI Common module.
- Different policies for resource sharing: EP, CQ, AV.
- policies will be "share endpoints", or each component create their own.
- Different policies for resource sharing: EP, CQ, AV.
- OFI Components Dynamic Priority (slide 16) Issue 5794 - MTL will discard default currently known "slow" providers.
- Cisco and amazon are seeing tests fail "socket closed"
- A patch about a month ago started treating a socket close as a hard error and aborting the job.
- Just need to apply one of two patches.
- One back off the hard error in shutdown condition.
- Just revert the patch that broke everything. *. As a community we should do one or the other. But this just hides the real problem.
- TCP patch just exposed a bigger problem.
- TCP BTL doesn't have a connection and handshake.
- Second PML calls abort, but during finalize perhaps it shouldn't abort.
- PML should know about it's BTLs going away.
- No longer a fence in Finalize.
- We all agree that we should apply the same logic everywhere
- If a commit breaks master, we should revert it, or disable it
- Doesn't shut the door to better fix / redesign later.
- PR 5241: Add MCA param for multithread opal_progress() (George, Arm)
- Arm presented some graph about ideas around multi-threaded opal_progress()
- Old design still capped at single thread.
- luck based performance (threads are undeterministic)
- Current design performs well in oversubscribed case.
- proposed design would be to move thread protection down to
component level.
- let every thread call opal_progress() to get the benefit of threads.
- can have some thread go through progress, and can have some in thread pool to do tasks.
- 1024 bytes +35% injection rate (shines at small msg, why not 1byte)
- What can be taskify?
- Matching
- Pack, unpack
- Anything that does not need to be in critical section.
- Downsides
- Task creation overhead.
- If only one thread waiting, more overhead/latency
- What we need from components:
- They have to be thread safe (or disqualify with THREAD_MULTIPLE)
- Components should not use lock in component_progress()
- For those not safe, can just use trylock instead.
- Using lock
- Out of sequence impact
- receiving
- MPI has a way to tell MPI not to enforce ordering
- Once you go multi-threaded, most of the time is then spent ordering.
- can get 100% speed up if app doesn't need ordering (app handles tags themselves).
- RDMA (bigger than FIFO), stalls pipelines, really bad.
- MADNESS performance (C++, bad mem management)
- 5 times slower on openib btl
- RDMA path lose a lot of performance because of out of order seq.
- We need to better educate users to use multiple communicators.
- Is this because of Communicator lock contention or seperate match space?
- Both.
- Is this because of Communicator lock contention or seperate match space?
- What do we do today?
- btl modules needs lock to protect it's progress.
- All agree that mt_opal_progress makes sense, in hopes to get better
injection rate.
- Looking for proposal on 'core' locking around btl progress functions.
- What about the single threaded?
- Will affect, but should be a single thread through a trylock (~0)
- With this functionality, single thread can easily have multiple progress threads.
- WORK Phases:
- mca param # threads allowed in opal_progress, default: 1
- Also means changing opal_progress() to be multi-threaded.
- ARM will PR
- Arm provides recommendation(s)
- maintainers update components w giant trylock().
- Any component that registers a progress function needs this trylock()
- libnbc as well? (libnbc already does this sort of)
- Everyone should do this, should be very simple.
- maintainers update components to make use of multiple contexts.
- Argue about default of mca param > 1
- End Goal:
- If components want to take advantage of multi-threaded progress, they can do a bit more work (3 lines)
- Make
- r2 / BTLs are initialized even when they are not used (Jeff)
- https://github.com/open-mpi/ompi/issues/5599
- Conclusion, there's not a great solution for NOT initializing the
BTLs because of one-sided.
- If we did this we need an answer for what BTL to use when we get there. Right now only portal UCX have an answer for that.
- Also not a straightforward way for multiple components to have
OFI have one endpoint be shared across multiple components.
- Ralph says we DO have a way to Allow the OFI MTL component to let the OFI BTL component know the other is using a certain resource. Could set a PMIx event that the other could catch.
- either extent libfabric / libpsm2, or write a common component between ompi and opal.
- RC tomorrow
- Release Monday!
- Nathan just filed 5889 blocker.
- Discuss TCP multilink support. What is possible, what we want to do and how can we do it.
- Amazon signed up to do this, but might not happen until Q1 2019
- Discuss TCP multiple IP on the same interface. What we want to do, and how we plan to do it.
- Right now if you have both IPv4 AND IPv6, only publish the IPv6 if they have IPv6 enabled on all nodes.
- Everyone's happy with this current behavior.
-
TCP BTL progress thread. Does 1 IP interface vs. >1 IP interface matter?
- Discussed this morning.
- ISSUE: Vader hit some bugs where compilers were ripping a 32bit write into two different 16bit writes.
- If you read the linux doc on evil things the compilers do.
- Linux has some macros WRITE_ONCE, READ_ONCE, ACCESS_ONCE.
But the macro only works for gcc (> v4.1), llvm, and/or intelcc
- Not sure if this is just a temporary measure until we have C11 compilant compilers.
- Should we limit the number of C compilers that can be used to compile the OMPI core (e.g., limit the amount of assembly/atomic stuff we need to support).
- E.g., PGI doesn't give us the guarantees we need
- Probably need to add some extra wrapper glue: e.g., compile OMPI core with C compiler X and use C compiler Y in
mpicc
.- Are there any implications for Fortran? Probably not, but Jeff worries that there may be some assumption(s) about LDFLAGS/LIBS (and/or other things?) such that: "if it works for the C compiler, it works for the Fortran compiler".
- Decided to require compilers that can correctly guarantee WRITE_ONCE, READ_ONCE, and ACCESS_ONCE macro
- xlC does not guarantee and so can't compile Open MPI core.
- Brian and Nathan will sort out who will do this work on master for next release from master.
- Will clearly state which dirs to be compiled with core, and what a compiler needs to do to comiple the 'core'.
- essentially configurey will ask fore 'core' compiler and 'main' compiler.
-
Shall we remove the OPAL pmix framework and directly call PMIx functions?
- Require all with non-PMIx environments to provide a plugin that implements PMIx functions with their non-PMIx library
- In other words, invert the current approach that abstracted all PMIx-related interfaces.
-
Why does Open MPI need to know about pmix server code?
- Only reason it's here, is because of the OPAL abstraction layer.
- Coming into the abstraction layer, because you're coming in with Opal types, and need to convert to pmix APIs.
-
Howard thought we'd already decided to just call PMIx directly.
-
And that this would be part of the same issue.
-
In the absence of ORTE, then have to download and use install PRTE and build it and do a LAMBOOT style.
-
LAMBOOT - could be buried under an mpirun like wrapper.
-
Engineering efforts?
- Making mpirun hide PRTE - Ralph says it's pretty trivial (he has something similar in PRTE)
- Would have a seperate "project" for this, and a number of ways to do this.
-
It is another piece with a different release schedule, and with different release goals. May be broader than what Open MPI needs.
-
If PMIx adds a new
-
New features coming down the road:
- Groups (part of sessions)
- Networking support in PMIx, but dont have a way to take advantage of in OMPI.
- Containers.
-
About 4 or 5 things in next 6 months.
-
Ralph is having a talk at super computing about user's applications using PMIx directly. And more users will want to use this.
-
Is there a way a user who linked against pmix intercept our MODEX?
-
MODEX is special, but nothing stops a user from also getting a callback on ceratin PMIx events.
-
If we get rid of opal wrappers that call PMIx, does that concern us for future maintainability?
- No, that was done back when we had to support multiple things, PMI1 and PMI2, and SLURM, etc.
-
If we did this, we the Open MPI community would expect PMIx to behave as an interface, like we do with MPI interface.
- PMIx has made that promice, and now this is why they have PMIx standard.
-
Following the shared library versioning, and API promices PMIx is then in same boat as hwloc, libevent, etc.
-
Why not make hwloc and libevent 1st class citizens.
-
Talk about getting rid of libevent.
- Next time do the work for libevent, we might consider move it UP to top level 3rdLevel directory.
- Take a look at opal hwloc in PRTE.
- We don't have an hwloc object, it's just a name translation.
- We just call opal_hwloc...
- part we have to retain, we define binding policies in
- PRTE doesn't have imbedded hwloc. All the base functions are pulled into opal_hwloc
- Ralph also has an opal event directory in PRTE.
- In PRTE there is no abstractions
- Oh wait, There IS and opal_pmix in PRTE, since there's some
conversions in there.
- Howard is taking a look at this.
-
Discussion expanded from just PMIx to also direct calling hwloc and libevent.
- Orte pushes events into libevent as a way to sequence them.
In a very large system, if you don't do things correctly, you can
see this.
- PMIx has been getting out of that habit and so everything's okay.
- Orte pushes events into libevent as a way to sequence them.
In a very large system, if you don't do things correctly, you can
see this.
-
hwloc not much work.
-
pmix side would need to do translation. (Error codes can't get rid of, no way we can line up those error codes).
- PMIx_info_ts - nice to get rid of doing these translations.
- Easy for Ralph to update, since he's already done this in PRTE.
-
Difficulty - How do you deal with differences in version levels?
- Easy for compile time differences. It either builds or doesn't.
- Runtime differences in PMIx support is more difficult.
- Example: Build OMPI against PMIx v4.0, but then RUN with v3.0 Linker should fail at runtime, if .so versioning is done correctly.
-
What do we do with SLRUM PMIx (16.05 first), and ALPS PMI?
- Is there one for ALPs? No, but no reason they couldn't implement PMIx. They will support the launch mechnaism at least.
-
If we do this for OMPI v5.0, and you still want to run under an older SLURM, user COULD launch PRTE_BOOT in their SLURM PROPOSAL:
- Remove ORTE
- Remove RTE layer?
- Add PRTE
- Modify mpirun / prterun - see if there's an existing PMIx server, if not auto PRTE_BOOT the PRTE daemons, and then launch.
-
PRTE would HAVE to go to formal release
-
Any components in ORTE not in PRTE.
-
QUESTION: could mpirun act like a wrapper around srun?
- No, but we could look at that.
- Does SLURM support spawn?
- If it did, could just call PMIx_Spawn, and let slurm launch the PMIx daemon.
-
This will destroy years of power point slides.
-
OMPI layer will just call PMIx.
-
Not THAT much work, because we don't have to change the ORTE side.
-
Howard is going to remove bfo component (removes a trouble spot)
-
Howard (ECP) in January timeframe.
-
Are we going to redistribute PRTE?
- Yes, It's got same license as Open MPI.
-
Ralph moves hwloc and libevent up.
- Some things that needs translation
- But Don't do name shifting.
- Want to see where the glue goes.
-
It will still check external first. and then build the internal pieces that don't exist.
-
Should we publish Open MPI release tarballs to https://github.com/open-mpi/ompi/releases?
- Per https://github.com/open-mpi/ompi/issues/5604, I posted a bunch of "Official releases aren't here..." on the github releases page.
- We all agreed. This is exactly the right amount of no work.
=====
- Ompi will call PMIx directly without opal wrappers.
- Non-PMIx approaches will need to provide PMIx translations
- For example PMI or PMI2
- rip out ORTE
- use released PRTE
- mpirun would wrap prte_boot and boot it up launch, and shut down.
- No name space shifting.
- do need our own errors.
- Intended for Open MPI v5.0 1st half of 2019
-
https://pmix.org/support/faq/how-does-pmix-work-with-containers/ addresses (some of) PMIx-to-PMIx compatibility issues.
- But what about OMPI to PMIx compatibility?
- And what about RM to PMIx compatibility?
- How do we convey what this multi-dimensional variable space means to users in terms of delivered MPI features?
- Case in point: https://github.com/open-mpi/ompi/issues/5260#issuecomment-421407400 (OMPI v3.0.x used with external PMIx 1.2.5, which resulted in some OMPI features not working).
- At least three dimentions:
- What does OMPI support and do?
- What does PMIx client support and do?
- What does PMIx server support and do?
- Giant rat-hole of issues.
- OMPI can choose which versions of PMIx OMPI will support.
- Would like to know WHAT failed, and WHY it failed in that mode.
- Not just SLURM failed MPI_Comm_spawn.
- But SLURM failed BECAUSE it was configured like.
- Probably easier to do this at error gathering at end, rather than begining
- Does PMIx have "WHO" is resource provider info?
- Artem (Ralph proxy) not now, but could add key/value, without change in standard.
- This would be optional, so it wouldn't change the standard.
- DONT MAKE THIS OPTIONAL! This is why things are so hard!
- Mellanox has this problem also (not sure what names of some keys are)
- Places in PMIx standard which are obscure in some way.
- Static capabilities we know at the begiing of time, we can put things in and reference when we fail.
- Dynamic capabilities are harder.
- Error reporting to the user is IMPORTANT.
- Scientists need error messages that they can understand what to do, and who to talk to.
- Jeff showed Josh's Opal_SOS - returned an error code at each level.
- Lower level thought this error is critical, but higher level didn't think it was a critical error, because it can try other approaches.
- Questions for PMIx
- Error runtime for pmix
- What about a tool that queries capability
- What PMIx functionality won't work? May not be that much.
- MPI_Comm_spawn - Spawn
- MPI_Connect
- Events - PMIx doesn't support this.
- Need to find a use case, and drive that home.
- Requirement is to identify where the error came from, and any additional description along the way, all the way back up to MPI.
- originally ompi-tests repo was private because we didn't know where they
came from (and if we have redistribution rights)
- Move the known safe things to a new repo that is public.
- Edgar developed some tests that are clearly ours.
- Lets make a new public repo called 'tests' under 'open-mpi' group.
- Edgar had one project that could be released
- Latency test suite
- ompi2 test suite
- Needs a LICENSE
- May want a release cycles.
- may want two repos (to be checked out seperately) Edgar will think about.
- Edgar had one project that could be released
- Thomas N has been on Oak Ridge National Lab ECP (Exascale Computing Project)
- Scope and objectives:
- want to minimize memory usage per rank.
- using Tao for these.
- Small patch to easily instrument OPAL for memory tracking.
- opal_object - and a few spots get variable name, total size, and count.
- subsctructures - like proc_t and how much of that is from the group_t.
- Example profiling run:
mpirun -np 2 tau_exec -T mpi.pdt ./a.out
- Fujitsu MPI for Post-K Computer
- Development Status in Fujitsu
- QA Activity in Fujitsu
- Open MPI really NEEDS vendors to check for new issues.
- SCHEDULE: Aiming for next summer (2019)
- ABI-changing commit on master (after v4.0.x branch) which will affect future v5.0.x branch: https://github.com/open-mpi/ompi/commit/11ab621555876e3f116d65f954a6fe184ff9d522.
- Do we want to keep it? It's a minor update / could easily be deferred.
- Keep it on master, since v5.0.x will be next master release mid-late 2019
- George will cleanup checkpoint restart
- C++ gone
- PRTE change
- Will NOT remove
--enable-mpi1-compat
or TKR version ofuse mpi
module. - WILL put the TKR version of
use mpi
module under some new--enable-SOMETHING
configury flag. Geoff Paulsen will get with Jeff Squyres and do this - Remove RoCE and iWARP support
- Does UCX still work on older things? - Yes new emulation handles it except for very minor differences.
- We didn't do this in Open MPI v4.0.0 because libfabric v1.7 was AFTER Open MPI v4.0.0
- We WILL remove openib btl for v5.0
- We do NOT need an iWARP (or ROCE) btl, UCX or Libfabric are fine.
- So, This all means no v4.1.x
- OMPIO automicity levels
- OMPIO external 32
- Review of v3 and v4 features
- v3 has shipped, in OMPI v4.0.0
- allows the scheduler/launcher to request a payload for PMIx that includes whatever info you want.
- For example will pickup env variables in plugins and forward along.
- Can ask it to assign networking resources.
- PMIx puts all of this into the payload. Launch message carries it along
to pmix daemons on the remote nodes, that knows how to pull out it's
pieces. - All preliminary work to take advantage of this, is in Open MPI.
-
pnet
framework has skeleton to use this. - Tool support for IO forwarding in v3
- v4 - Howard
- v4 includes the network topology graph, and provide a graph Right now, only doing it for Omnipath.
- Two ways to get this data:
- pmixd query hwloc, and push it back up to head node, and wire up
and send back down 'roll up' meathod - doesn't get switches
- believe this appraoch only gets physical not virtual
- Query the subnetmanager.
- pmixd query hwloc, and push it back up to head node, and wire up
and send back down 'roll up' meathod - doesn't get switches
- Still talking about pointers to data, rather than copies of data. not sure if this will make PMIx v4
- Outline changes in OMPI required to support them
- If we follow up where PMIx symbols are exposed
- Outline changes for minimizing footprint (modex pointers instead of copies)
- Decide which features OMPI wants to use
- Are there potential problems with users making pmix calls on same namespace
as MPI?
- Only thing we could think of is if user's app compiled against a different version of pmix.
- Released the formal v2.0 standard.
- Have a v3.0 draft standard sometime end of month.
- How to orchestrate it?
- PRTE doesn't support MPIR.
- Publicly said:
- deprecate warning in mid-late 2018
- remove MPIR in mid-late 2019 (lines up nicely with move to PRTE change)
- If someone sets MPIR, can print a message to SHOW HELP
- mpirun has a function that checks the "being debugged" flag (defaults 0)
MPIR_being_debugged
- Ralph will put a show-help message, Jeff will edit PR
- Will get this Deprecation warning into v4.0.0 in next few days.
-
Something happened in August where building was struck by lighting in basement
-
no redundancy or power strike. Something got fried, and took up to 7 days.
- OMPI core
- OMPI devel/packagers
- OMPI users/announce
- OMPI commits
- HWLOC commits
- HWLOC devel/users/announce
- MTT users/devel
- Do we want to move to a commercial hosting site? Would cost about $21/month, or about $250/year
-
Additionally: The Mail Archive is going through some changes. Unclear yet as to whether this will impact us or not
- https://www.mail-archive.com/[email protected]/msg01586.html
- We imagine some Open Source friendly solution would also avail itself.
-
Ralph donated 3 years of hostgator (Expires July of next Year).
- When this runs out, move website to AWS.
- also hosting @openmpi.org emails for some developers.
-
As of 4 Sep 2018, Open MPI has $575 in our account at SPI.
-
How do we detect if mail server is down?
-
Could have an hourly cron that checks if latest commit is the same as the last commit message.
-
Why don't we just move to AWS?
- It's non-zero work. Lets just give mailman again, also $21/month is CHEAP
- This sounds good to pay $250/year
- Jeff created open-mpi.slack.com workspace and invited us.
=====
-
ORTE support model
- See https://docs.google.com/document/d/1VwqUVAhkeJt7PmaQCBQpUXBYdy9Elg5m-TBzY-6lLQY/edit
- Should we remove ORTE from the Open MPI repo and make it a separate project?
- How do we resolve the "one package" philosophy we have embraced from day one?
-
Nathan/Brian: Vader bug cleanups
- Want to strengthen the recent vader fixes to be fully bulletproof
-
openib: persistent error reported by multiple users
- https://github.com/open-mpi/ompi/issues/5914
- Feels like this should be easy to fix...?