Skip to content
rhc54 edited this page Feb 24, 2016 · 121 revisions

Feb 2016 OMPI Developer's Meeting

Dates:

  • Start: 9am Tuesday, February 23, 2016
  • Finish: 1pm Thursday, February 25, 2016

Location:

IBM Dallas Innovation Center web site

  • Google Maps
  • Street address: 1177 South Beltline Road, Coppell, Texas 75019 USA
  • Enter on the East entrance (closest to Beltline Road)
    • Hollerith Room - On left after you walk in. (Now All 3 days, in the same room!)
    • Receptionist should have nametags for everyone.
    • Foreign Nationals welcome.
    • No need to escort visitors in this area.

Map of Hotels:

  • These 3 Blue hotels offer shuttles both to/from IBM site AND the DFW International Airport:
    • Sheraton Grand Hotel (972-929-8400) - 4440 West John W Carpenter Freeway, Irving, TX 75063, USA
    • Holiday Inn Express (972-929-4499) - 4550 West John W Carpenter Freeway, Irving, TX 75063, USA
    • Hampton Inn (972-471-5000) - 1750 TX-121, Grapevine, TX 76051, USA
  • https://www.mapcustomizer.com/map/IBM%20IIC%20and%20Hotels%20-%20Map2?utm_source=share&utm_medium=email
  • NOTE: We stopped calling after we found 3 hotels that offered shuttles BOTH to the IBM site, and to the airport.

Attendees

Attendees:

  1. Jeff Squyres - Cisco
  2. Ralph Castain - Intel
  3. Edgar Gabriel - UHouston
  4. Stan Graves - IBM
  5. Geoff Paulsen - IBM
  6. Perry Schmidt (in and out) - IBM
  7. Dave Solt - IBM
  8. Brice Goglin - Inria
  9. Howard Pritchard - LANL
  10. George Bosilca - UTK
  11. Takahiro Kawashima - Fujitsu
  12. Shinji Sumimoto - Fujitsu
  13. Annu Dasari - Intel
  14. Nysal Jan K A - IBM
  15. Sylvain Jeaugey - NVIDIA
  16. Artem Polyakov - Mellanox
  17. Sameh Sharkawi - IBM
  18. Nathan Hjelm - LANL

Attending Remotely:

  1. Josh Hursey - IBM (Added a ☎️ icon next to the items I'd like to call in for, if possible)

Suggested topics:

  1. ☎️ Geoff Paulsen: fast path for send when only 1 outstanding request (a la Platform MPI)
  2. ☎️ Ralph/George/Jeff/Edgar: establishing dependencies in frameworks. This is a somewhat larger topic, but for the time being we will focus on a simple usage scenario...
    • In December, Ralph/George/Jeff talked about Intel's desire to use some of OPAL/ORTE/OMPI's frameworks in its own projects. Ralph had previously been copying source code around between repositories, and it was pretty much a mess.
    • After much discussion, we realized that Ralph can just link projects like SCON against libopen-rte.so and just access the frameworks that he wants. I.e., do a full install of ORTE (or probably OMPI?) using --with-devel-headers, and then literally treat OMPI/ORTE/OPAL frameworks just like any other shared library. Yay!
    • Sidenote: just for cleanliness, Ralph will "tighten up" each ORTE framework to require as few dependencies as possible
    • This is a one caveat, however: the application linking against libopen-rte likely doesn't want to call orte_init() to initialize all of ORTE; it just wants to use a few frameworks. So the application will need to know the "magic order" in which frameworks must be initialized (i.e., the contents and ordering of orte_init() and the ESS framework init [which is where most of the heavy lifting for ORTE initialization actually occurs these days]).
    • What would be better is if the frameworks themselves could declare what their dependencies are. E.g., if the application initializes the ABC framework, the ABC framework should be able to realize that it requires the DEF framework to be initialized first. This could possibly be done with a registration-based system. E.g., in the first few lines of framework open and/or init, it can call something like opal_register_framework_dependency("ABC", "DEF"). Components could even do the same thing; for example, the ob1 PML needs the BTL framework, so it could call opal_register_component_dependency("pml", "ob1", "btl")... or something like this).
    • At this meeting, we'd like to discuss the possibilities for such a system, and sketch out what the API should be.
    • Such a system could actually be used throughout all of OPAL, ORTE, OMPI, OSHMEM. It would not only eliminate the "magic ordering" that we have in opal_init(), orte_init(), ompi_runtime_init(), and oshmem_runtime_init(), but also allow for minimal initialization in cases where not all components are necessary for a particular run.
    • MPI_Comm_Info ramifications...
    • NOTE: This could also be useful for the "Additional tests to check for inadvertant dependencies between components / frameworks" bullet, much further down in this list.
  3. Jeff: What should be our timeframe for forking for v3.0.0?
  4. Discuss some additional tests
    • Can we have some tests to check to see if there are components that are dependent upon other components (and should not be)?
    • @edgargabriel has some issues in OMPIO that he'd like to do better: many of the OMPIO sub-frameworks (e.g., fcoll, fbtl, ...etc.) require the functionality of the ompio component in the io framework. How should he do this?
    • 'nm' test for each framework?
    • Singularity test(s)
  5. ☎️ Thread multiple support - where are we on this?
  6. Async progress - status? What still needs to be done?
  7. Memory Pool/Registration Cache rewrite
  8. George: request overhaul?

RESULTS

  1. Can we double check that all vendor Open MPI distributions are appropriately marked? E.g., via --with-ident-string? Not sure how to do this other than to ask each vendor/distribution -- perhaps we should default to "Unknown distribution" and have the nightly/release scripts set the official strings...?
    • IBM - working on it
    • Cisco - verified has ident-string
    • Fujitsu - modifies code to include Fujitsu-specific text
    • Mellanox - working
    • Bull - Sylvain will ping someone to check
    • Intel - not sure if they distribute, but Ralph will check
    • We will change the master repo to say "Unknown distribution" and modify the nightly/release scripts to set it to "Open MPI Community Distribution"
  2. Nathan: Do we really need --enable-mpi-thread-multiple any more?
    • None of the BTLs can set threads on if we configure it off
    • Becoming of greater interest due to increased cores/die
    • Some performance hit on shared memory (~40 nanoseconds) due to checking if opal_using_threads()
    • Perhaps enable by default, but retain the ability to disable for those not wanting to accept the hit?
    • Nathan to create PR with the change, modify opal_using_threads to add OPAL_UNLIKELY to help reduce the penalty, will flag Brian to ask if/why this is a bad idea
  3. ☎️ Jeff: Git etiquette: should we start doing "git standard" first-lines in commit messages?
    • There are some markdown techniques that make searches easier (git shortlog and git graph)
    • Request that people follow the proper format
      • First line: area affected by the commit followed by a colon, followed by short one-line description
      • Blank line
      • Longer description
    • Here's a good description of good git commit messages
  4. Ralph: Further cleanup of the code base for project separation
    • Split autogen.pl and configure.ac by project?
    • Cleanup naming conventions - still have "ompi" in the "opal" layer, "ompi" named configure variables in the opal layer, etc.
    • Go ahead, gradual series of PR's to split the projects
    • General request that people cleanup use of OMPI names in OPAL
  5. Discussion that came up on the user list about how to help users ensure that they build Open MPI the way that they think they have built it (e.g., did they really build TM support?): http://www.open-mpi.org/community/lists/devel/2016/01/18497.php
    • How to have a solution that is "easy enough" for average users, but powerful enough to catch common cases (e.g., where feature X's headers/libs are in a non-default location, and user assumes that OMPI found them anyway/included support).
    • Would be sooooo nice if we could do something other than Bourne shell programming! :frown: (i.e., for each framework / component to save itself in a data structure/hash, and then prettyprint that data structure at the end)
    • What if we put only "critical" things in a summarized list at the end of configure? Problem is: how do we decide which is and isn't included?
    • Nathan will add a .m4 macro to add your functionality to the summary to be printed at the end of configure. We will self-police to keep this from becoming overwhelming
    • Issue filed: Modify ompi_info --all to include output of the framework/component table
    • Maybe someday when someone has time: some kind of "make menu" that emits a platform file?
    • Geoff Paulsen will generate an example platform file that contains all the options, manually edited to break it into sections (e.g., RTE's, networks)
  6. Jeff+Ralph: --host and --hostfile behavior (Ralph's favorite topic!).
    • The PR https://github.com/open-mpi/ompi/pull/1353 seems to have gone too far
    • Can we avoid a flip-flip between 1.10.x and 2.x?
    • Geoff - Interest in --hostlist behavior from Platform-MPI? looks something like: --hostlist "mpi0[1-4,6]:2"
    • Another much more complex example: --hostlist "mpi[01-10:2,13-11:4,16]:2,clustA[01-04]:4"
    • If num procs is not specified, we will exit with error
    • Add a new option for "fill-me-up", one per "slot"
    • If the user cmd line requires topological detail, then we will look to the hostfile and resource manager (via PMIx) first. If we can't get it otherwise, then launch the DVM to get it. Otherwise, we will not launch the DVM as we don't need the topological detail.
    • Geoff will create a new host parsing framework with these three new components. It will be multi-select, with each component looking at the orte_job_t to decide if an option was provided that they should parse. Output will be the list of node objects
    • hostlist will require use of "sequential" mapper. Sequential mapper will not require -np to be provided.
  7. Jeff: Do we want to create some templates for Github issues / pull requests?
    • Per https://github.com/blog/2111-issue-and-pull-request-templates
    • If nothing else, perhaps an issue template for ompi-release that says "DON'T FILE ISSUES AGAINST THIS REPO"
    • Howard will create a template for ompi-release that tells the user not to file issues there
    • Add bot to create issues to move PRs against master to identified branches when PR is merged into master, auto-assign issues to PR author
    • Perhaps use labels in place of or in addition to issues? Implementer can decide!
  8. Ralph: Don't try to bind tasks when direct launched by an external launcher that did no binding?
    • See https://github.com/open-mpi/ompi/pull/1386
    • use-case of concern in cited ticket was OpenMP. We have hit this issue of binding on OpenMP applications before. With the schizo framework, and the ability to specify multiple personalities, should we add a personality option for OpenMP that turns off binding?
    • if user explicitly states "no binding" to the resource manager on a direct launch cmd line, should we disable our auto-binding?
    • Leave the case where nothing is specified on direct launch alone - we will still bind by default in that case. Users can use the MCA param to request "do-not-bind", and admins can set that in the default MCA param file if they want
    • Look for envars to indicate "bind-to-none" was given to the RM on direct launch, and don't bind if so
    • Leave rest alone
  9. Howard: let's start discussing the features we want for v2.1.0.
    • OpenMP/Open MPI interop - might be able to help support this via schizo framework
      • Howard is experimenting with this now - sounds like schizo should provide the necessary hooks
    • ...add your own favorite features here...
      • Singularity container support
      • PLEASE ADD FEATURES HERE
  10. George: Per #1308, there's an ambiguity between the info key value that a user/application sets with MPI_COMM_INFO_SET and the value that is actually propagated to a child communicator via MPI_COMM_DUP_WITH_INFO: does it use the value that the user set, or the value that OMPI decided to use?
  11. Geoff: mpirun --entry method for wrapping multiple PMPI profiling libraries at runtime via fancy function pointer code.
  12. Geoff: mpirun --aff yet another way to specify bindings. From Platform MPI, very similar to existing method at backend (parsing the strings obtained from hwload), but exposes different syntax via mpirun -aff ____. Notes : https://gist.github.com/markalle/52fb42c701267c49b450
  13. Geoff: mpirun --prot - print pt2pt connection methods used for process to process communication.
    • Ralph pointed out that some mapping of Platform-to-OMPI options could be accommodated via schizo framework. Will work with Geoff to see what makes sense and/or needs to be changed to make that feasible
    • New options will be added by IBM via PR
  14. ☎️ Artem: Mellanox/LANL have raised a good point that customers do not want to have multiple different Open MPI installations in their environments (e.g., Vendor A OMPI and Vendor B OMPI and community OMPI).
    1. How can a customer have a single OMPI installation, but still have vendor/distribution-specific enhancements?
    2. Or is that the wrong question -- should we really be working to enable individual component distribution? This would disallow vendors from distributing patches to core.
    • We don't take the patches in our tarball, but we did add the infrastructure so user's can download patches and then use this option to apply them on-site
    • Mellanox is satisfied with this approach, so nothing needs to be done.
  15. Jeff: Per https://github.com/open-mpi/ompi/issues/1379, it looks like MPI processes are "becoming invisible" to the resource manager (SLURM, in Cisco's case). Ralph and Jeff are pretty sure that are some point in the past, we set the orted to create its own process group and launch all MPI processes in that. This means that if/when the RM tries to kill the process group that it launches, it won't kill any of the MPI processes (because they opted out / created their own process group).
    • Ralph thinks that this may have been done so that we could deliver a signal to the entire MPI process group and not signal the orted. He remembers that this was a Sun-asked-for feature.
    • We might not want that behavior by default any more -- for the reasons cited on https://github.com/open-mpi/ompi/issues/1379 (i.e., that old stale processes can get left around and not killed by the resource manager).
    • Let's check the code and decide if we want to revisit this decision of creating a process group by default.
    • How does is this affected when run under CGROUPs?
    • Ralph will modify the orted to not put the application procs in a separate process group
    • Please see https://www.open-mpi.org/community/lists/devel/2016/02/18612.php for explanation of the issue
    • IBM suggests reuse of Platform approach: have orted hit itself with SIGTSTP and trap it, signal will propagate to all children since they remain in orted's process group, orted ignores signal to remain awake
  16. Jeff+George: discuss installation of monitoring test (https://github.com/open-mpi/ompi/pull/1390)
    • Resolved: George to move the code to ompi/contrib and file a PR to move it to v2.1.0.
  17. ☎️ Ralph: PMIx
    • Fault response APIs - definition, requests
    • Information request APIs - definition, requests
    • If/how we integrate the above into OMPI
      • Expose desired APIs via MPI extension, support multiple fault response modes (e.g., ULFM, run thru, etc)
    • Memory and file descriptor leaks in the server
    • Remove the barrier at the end of MPI_Init?
      • Will make this an option, default to barrier, controlled by MCA parameter
  18. MTT status
    • Ralph and Josh described current status
  19. Artem: Shipping of UCX with OMPI sources.
    • No issue so long as the build system is setup to allow internal and external library support as upstream distros will not allow the embedded library
  20. Jeff: Now that high-speed networks can be accessed via multiple network stacks (and multiple Open MPI components), users are getting confused about how to (un)select specific networks. We need to figure out a better / easier way for users to convey what they want. See http://www.open-mpi.org/community/lists/devel/2015/10/18154.php for more detail.
    1. One (very loose and probably not yet well thought-out) idea is to have some kind of higher-level abstraction: --[enable|disable] NETWORK_TYPE:QUALIFIER
    2. Related question for MPI_T: we have tool developers asking if we can report the following:
      1. Network interface(s) being used
      2. Some kind of descriptive name of the network interface(s) being using for each interface
      3. Transport protocol being used on each interface
    3. Add command line option (e.g., --net) so user can specify "disable:netname" to direct that all components connected to that "netname" shall disqualify themselves
    4. Also exposed as an MCA parameter
    5. Components will register one or more "short name" identifiers that can be checked against those given by user
    6. Short names will be exposed via ompi_info
    7. Hide the "self" BTL as it is not optional anyway
    8. Still many issues (e.g., does "disable TCP" disable the RTE's TCP components as well?), but we will take baby steps first and deal solely with the MPI layer
  21. ☎️ Fujitsu: Collaboration of Fujitsu and Open MPI Community
    • Source Code Contribution
    • Collaborative Development
    • Contribution for Quality
  22. ☎️ Josh: Review MTT database schema
    • There are evolving requirements on this schema - e.g., correlation to external data, addition of inventory, referencing the actual .ini file. Let's see if a more flexible schema can be devised that can accommodate the broader set of requirements
    • Yay!
  23. Renaming the component DLL's using the project-level name - i.e., change mca_ess_tm.la to orte_ess_tm.la. This would remove the current restriction against having the same framework name in two different projects, which is becoming more of an issue as ORTE and OPAL are reused.
    • Ralph will publish a proposal - we will rename libraries to .la
  24. George+Nathan: Alternative mechanisms for tagging sentinel ompi_proc_t locations that preserve 32-bit support and do not depend on the size of the opal_process_name_t. See https://github.com/open-mpi/ompi/pull/1345
    • This may have been solved already...?
    • George and Nathan have agreed
  25. Jeff+Ralph: Re-discuss the separation of our libraries into libmpi, libopen-rte, and libopen-pal.
    • Beginning of the project: there was just libmpi. Later, it was split into projects, and then the project libraries. Later, the build was unified back into libmpi again.
    • In Dec 2012 (here's the commit), we split the build back into 3 libraries. The commit message cites discussion at the Dec 2012 Open MPI dev meeting -- but there's unfortunately no clues as to the rationale why this was done in the wiki notes. Was it just because we developers like having 3 smaller libraries? Or is there some deeper technical issue? Neither Ralph nor Jeff remembers. 😦
    • Rationale for bringing this up again: when upstream projects are trying to link against portions of our project, and then also support apps that link against all of it, we run into conflicts (e.g., the ORTE being used by the upstream project may be different than the one being used by the OMPI installation). Slurping it all up into one library -- and making only the MPI API be visible -- would resolve the problem. ...but we cannot recall if there are undesirable side-effects.
    • No issue with recombining libraries, need to look closer at whether ORTE/OPAL symbols still need to be public
  26. Ralph+George: Routing framework
    • What does this morph into as we move to OFI/BTLs under RML?
    • What happens to RML resilience, which currently flows through the routed framework?
    • Deferred to offline
  27. Jeff+Ralph: TCP OOB component currently takes all IP addresses from all peers and tries to connect to them in order. If it fails to connect, it will timeout and move on and try the next IP address to the peer -- but it can stall the job for 30-60 seconds (during MPI_INIT) with no real output/feedback to the user, unless oob_base_verbose is high.
    • Can we make this better?
    • E.g., use usnic-btl-like network graph solving to figure out which local interface should be used to communicate with which remote interface?
    • Or simultaneously open multiple non-blocking connect(2)ions and see which/if any succeed?
    • Or ...?
    • Deferred to offline
  28. Jeff+Ralph: What to do about conflicting OPAL version numbers?
    • Jeff has an old note about this -- something about conflicting version numbers between OPAL and ORCM...? I don't remember the exact context.
    • The usNIC BTL currently uses the OPAL version number to determine what to do w.r.t. compatibility between the v1.10, master, and v2.x trees.
    • Deferred to offline
  29. Geoff: Licensing framework - compiles out by default. Provide a component and it will SLURP it into library. Hooks in MPI_INIT, MPI_FINALIZE, all of the local calls allowed before MPI_INIT.
    • Jeff asks: do you need MPI_T_INIT and MPI_T_FINALIZE, too? What about orte_init and/or opal_init[_*]?
    • Generic hook name, static framework, in opal layer
Clone this wiki locally