forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 1
Meeting 2016 02
Artem Polyakov edited this page Feb 24, 2016
·
121 revisions
Dates:
- Start: 9am Tuesday, February 23, 2016
- Finish: 1pm Thursday, February 25, 2016
Location:
IBM Dallas Innovation Center web site
- Google Maps
- Street address: 1177 South Beltline Road, Coppell, Texas 75019 USA
- Enter on the East entrance (closest to Beltline Road)
- Hollerith Room - On left after you walk in. (Now All 3 days, in the same room!)
- Receptionist should have nametags for everyone.
- Foreign Nationals welcome.
- No need to escort visitors in this area.
Map of Hotels:
- These 3 Blue hotels offer shuttles both to/from IBM site AND the DFW International Airport:
- Sheraton Grand Hotel (972-929-8400) - 4440 West John W Carpenter Freeway, Irving, TX 75063, USA
- Holiday Inn Express (972-929-4499) - 4550 West John W Carpenter Freeway, Irving, TX 75063, USA
- Hampton Inn (972-471-5000) - 1750 TX-121, Grapevine, TX 76051, USA
- https://www.mapcustomizer.com/map/IBM%20IIC%20and%20Hotels%20-%20Map2?utm_source=share&utm_medium=email
- NOTE: We stopped calling after we found 3 hotels that offered shuttles BOTH to the IBM site, and to the airport.
Attendees:
- Jeff Squyres - Cisco
- Ralph Castain - Intel
- Edgar Gabriel - UHouston
- Stan Graves - IBM
- Geoff Paulsen - IBM
- Perry Schmidt (in and out) - IBM
- Dave Solt - IBM
- Brice Goglin - Inria
- Howard Pritchard - LANL
- George Bosilca - UTK
- Takahiro Kawashima - Fujitsu
- Shinji Sumimoto - Fujitsu
- Annu Dasari - Intel
- Nysal Jan K A - IBM
- Sylvain Jeaugey - NVIDIA
- Artem Polyakov - Mellanox
- Sameh Sharkawi - IBM
- Nathan Hjelm - LANL
Attending Remotely:
- Josh Hursey - IBM (Added a ☎️ icon next to the items I'd like to call in for, if possible)
Wednesday morning suggestions (requested by Perry and Rayaz):
- PMIx
- MTT
- ☎️ Ralph: PMIx
- Fault response APIs - definition, requests
- Information request APIs - definition, requests
- If/how we integrate the above into OMPI
- Memory and file descriptor leaks in the server
- Remove the barrier at the end of MPI_Init?
- Jeff: Now that high-speed networks can be accessed via multiple network stacks (and multiple Open MPI components), users are getting confused about how to (un)select specific networks. We need to figure out a better / easier way for users to convey what they want. See http://www.open-mpi.org/community/lists/devel/2015/10/18154.php for more detail.
- One (very loose and probably not yet well thought-out) idea is to have some kind of higher-level abstraction:
--[enable|disable] NETWORK_TYPE:QUALIFIER
- Related question for MPI_T: we have tool developers asking if we can report the following:
- Network interface(s) being used
- Some kind of descriptive name of the network interface(s) being using for each interface
- Transport protocol being used on each interface
- One (very loose and probably not yet well thought-out) idea is to have some kind of higher-level abstraction:
- Jeff+Ralph: Re-discuss the separation of our libraries into libmpi, libopen-rte, and libopen-pal.
- Beginning of the project: there was just libmpi. Later, it was split into projects, and then the project libraries. Later, the build was unified back into libmpi again.
- In Dec 2012 (here's the commit), we split the build back into 3 libraries. The commit message cites discussion at the Dec 2012 Open MPI dev meeting -- but there's unfortunately no clues as to the rationale why this was done in the wiki notes. Was it just because we developers like having 3 smaller libraries? Or is there some deeper technical issue? Neither Ralph nor Jeff remembers. 😦
- Rationale for bringing this up again: when upstream projects are trying to link against portions of our project, and then also support apps that link against all of it, we run into conflicts (e.g., the ORTE being used by the upstream project may be different than the one being used by the OMPI installation). Slurping it all up into one library -- and making only the MPI API be visible -- would resolve the problem. ...but we cannot recall if there are undesirable side-effects.
- ☎️ Geoff Paulsen: fast path for send when only 1 outstanding request (a la Platform MPI)
- ☎️ Ralph/George/Jeff/Edgar: establishing dependencies in frameworks. This is a somewhat larger topic, but for the time being we will focus on a simple usage scenario...
- In December, Ralph/George/Jeff talked about Intel's desire to use some of OPAL/ORTE/OMPI's frameworks in its own projects. Ralph had previously been copying source code around between repositories, and it was pretty much a mess.
- After much discussion, we realized that Ralph can just link projects like SCON against
libopen-rte.so
and just access the frameworks that he wants. I.e., do a full install of ORTE (or probably OMPI?) using--with-devel-headers
, and then literally treat OMPI/ORTE/OPAL frameworks just like any other shared library. Yay! - Sidenote: just for cleanliness, Ralph will "tighten up" each ORTE framework to require as few dependencies as possible
- This is a one caveat, however: the application linking against
libopen-rte
likely doesn't want to callorte_init()
to initialize all of ORTE; it just wants to use a few frameworks. So the application will need to know the "magic order" in which frameworks must be initialized (i.e., the contents and ordering oforte_init()
and the ESS framework init [which is where most of the heavy lifting for ORTE initialization actually occurs these days]). - What would be better is if the frameworks themselves could declare what their dependencies are. E.g., if the application initializes the ABC framework, the ABC framework should be able to realize that it requires the DEF framework to be initialized first. This could possibly be done with a registration-based system. E.g., in the first few lines of framework open and/or init, it can call something like
opal_register_framework_dependency("ABC", "DEF")
. Components could even do the same thing; for example, the ob1 PML needs the BTL framework, so it could callopal_register_component_dependency("pml", "ob1", "btl")
... or something like this). - At this meeting, we'd like to discuss the possibilities for such a system, and sketch out what the API should be.
- Such a system could actually be used throughout all of OPAL, ORTE, OMPI, OSHMEM. It would not only eliminate the "magic ordering" that we have in
opal_init()
,orte_init()
,ompi_runtime_init()
, andoshmem_runtime_init()
, but also allow for minimal initialization in cases where not all components are necessary for a particular run. - MPI_Comm_Info ramifications...
- NOTE: This could also be useful for the "Additional tests to check for inadvertant dependencies between components / frameworks" bullet, much further down in this list.
- Jeff: What should be our timeframe for forking for v3.0.0?
- Ralph+George: Routing framework
- What does this morph into as we move to OFI/BTLs under RML?
- What happens to RML resilience, which currently flows through the routed framework?
- Jeff+Ralph: TCP OOB component currently takes all IP addresses from all peers and tries to connect to them in order. If it fails to connect, it will timeout and move on and try the next IP address to the peer -- but it can stall the job for 30-60 seconds (during MPI_INIT) with no real output/feedback to the user, unless
oob_base_verbose
is high.- Can we make this better?
- E.g., use usnic-btl-like network graph solving to figure out which local interface should be used to communicate with which remote interface?
- Or simultaneously open multiple non-blocking
connect(2)
ions and see which/if any succeed? - Or ...?
- Renaming the component DLL's using the project-level name - i.e., change mca_ess_tm.la to orte_ess_tm.la. This would remove the current restriction against having the same framework name in two different projects, which is becoming more of an issue as ORTE and OPAL are reused.
- George+Nathan: Alternative mechanisms for tagging sentinel ompi_proc_t locations that preserve 32-bit support and do not depend on the size of the opal_process_name_t. See https://github.com/open-mpi/ompi/pull/1345
- This may have been solved already...?
- Jeff+Ralph: What to do about conflicting OPAL version numbers?
- Jeff has an old note about this -- something about conflicting version numbers between OPAL and ORCM...? I don't remember the exact context.
- The usNIC BTL currently uses the OPAL version number to determine what to do w.r.t. compatibility between the v1.10, master, and v2.x trees.
- ☎️ Fujitsu: Collaboration of Fujitsu and Open MPI Community
- Source Code Contribution
- Collaborative Development
- Contribution for Quality
- Discuss some additional tests
- Can we have some tests to check to see if there are components that are dependent upon other components (and should not be)?
- @edgargabriel has some issues in OMPIO that he'd like to do better: many of the OMPIO sub-frameworks (e.g., fcoll, fbtl, ...etc.) require the functionality of the ompio component in the io framework. How should he do this?
- 'nm' test for each framework?
- Singularity test(s)
- ☎️ Josh: Review MTT database schema
- There are evolving requirements on this schema - e.g., correlation to external data, addition of inventory, referencing the actual .ini file. Let's see if a more flexible schema can be devised that can accommodate the broader set of requirements
- ☎️ Thread multiple support - where are we on this?
- Async progress - status? What still needs to be done?
- Geoff: Licensing framework - compiles out by default. Provide a component and it will SLURP it into library. Hooks in
MPI_INIT
,MPI_FINALIZE
, all of the local calls allowed beforeMPI_INIT
.- Jeff asks: do you need
MPI_T_INIT
andMPI_T_FINALIZE
, too? What aboutorte_init
and/oropal_init[_*]
?
- Jeff asks: do you need
- Memory Pool/Registration Cache rewrite
- George: request overhaul?
- Artem: Shipping of UCX with OMPI sources.
- Can we double check that all vendor Open MPI distributions are appropriately marked? E.g., via
--with-ident-string
? Not sure how to do this other than to ask each vendor/distribution -- perhaps we should default to "Unknown distribution" and have the nightly/release scripts set the official strings...?- IBM - working on it
- Cisco - verified has ident-string
- Fujitsu - modifies code to include Fujitsu-specific text
- Mellanox - working
- Bull - Sylvain will ping someone to check
- Intel - not sure if they distribute, but Ralph will check
- We will change the master repo to say "Unknown distribution" and modify the nightly/release scripts to set it to "Open MPI Community Distribution"
- Nathan: Do we really need
--enable-mpi-thread-multiple
any more?- None of the BTLs can set threads on if we configure it off
- Becoming of greater interest due to increased cores/die
- Some performance hit on shared memory (~40 nanoseconds) due to checking if opal_using_threads()
- Perhaps enable by default, but retain the ability to disable for those not wanting to accept the hit?
- Nathan to create PR with the change, modify opal_using_threads to add OPAL_UNLIKELY to help reduce the penalty, will flag Brian to ask if/why this is a bad idea
- ☎️ Jeff: Git etiquette: should we start doing "git standard" first-lines in commit messages?
- There are some markdown techniques that make searches easier (git shortlog and git graph)
-
Request that people follow the proper format
- First line: area affected by the commit followed by a colon, followed by short one-line description
- Blank line
- Longer description
- Here's a good description of good git commit messages
- Ralph: Further cleanup of the code base for project separation
- Split autogen.pl and configure.ac by project?
- Cleanup naming conventions - still have "ompi" in the "opal" layer, "ompi" named configure variables in the opal layer, etc.
- Go ahead, gradual series of PR's to split the projects
- General request that people cleanup use of OMPI names in OPAL
- Discussion that came up on the user list about how to help users ensure that they build Open MPI the way that they think they have built it (e.g., did they really build TM support?): http://www.open-mpi.org/community/lists/devel/2016/01/18497.php
- How to have a solution that is "easy enough" for average users, but powerful enough to catch common cases (e.g., where feature X's headers/libs are in a non-default location, and user assumes that OMPI found them anyway/included support).
- Would be sooooo nice if we could do something other than Bourne shell programming! :frown: (i.e., for each framework / component to save itself in a data structure/hash, and then prettyprint that data structure at the end)
- What if we put only "critical" things in a summarized list at the end of configure? Problem is: how do we decide which is and isn't included?
- Nathan will add a .m4 macro to add your functionality to the summary to be printed at the end of configure. We will self-police to keep this from becoming overwhelming
- Issue filed: Modify ompi_info --all to include output of the framework/component table
- Maybe someday when someone has time: some kind of "make menu" that emits a platform file?
- Geoff Paulsen will generate an example platform file that contains all the options, manually edited to break it into sections (e.g., RTE's, networks)
- Jeff+Ralph:
--host
and--hostfile
behavior (Ralph's favorite topic!).- The PR https://github.com/open-mpi/ompi/pull/1353 seems to have gone too far
- Can we avoid a flip-flip between 1.10.x and 2.x?
- Geoff - Interest in --hostlist behavior from Platform-MPI? looks something like: --hostlist "mpi0[1-4,6]:2"
- Another much more complex example: --hostlist "mpi[01-10:2,13-11:4,16]:2,clustA[01-04]:4"
- If num procs is not specified, we will exit with error
- Add a new option for "fill-me-up", one per "slot"
- If the user cmd line requires topological detail, then we will look to the hostfile and resource manager (via PMIx) first. If we can't get it otherwise, then launch the DVM to get it. Otherwise, we will not launch the DVM as we don't need the topological detail.
- Geoff will create a new host parsing framework with these three new components. It will be multi-select, with each component looking at the orte_job_t to decide if an option was provided that they should parse. Output will be the list of node objects
- hostlist will require use of "sequential" mapper. Sequential mapper will not require -np to be provided.
- Jeff: Do we want to create some templates for Github issues / pull requests?
- Per https://github.com/blog/2111-issue-and-pull-request-templates
- If nothing else, perhaps an issue template for ompi-release that says "DON'T FILE ISSUES AGAINST THIS REPO"
- Howard will create a template for ompi-release that tells the user not to file issues there
- Add bot to create issues to move PRs against master to identified branches when PR is merged into master, auto-assign issues to PR author
- Perhaps use labels in place of or in addition to issues? Implementer can decide!
- Ralph: Don't try to bind tasks when direct launched by an external launcher that did no binding?
- See https://github.com/open-mpi/ompi/pull/1386
- use-case of concern in cited ticket was OpenMP. We have hit this issue of binding on OpenMP applications before. With the schizo framework, and the ability to specify multiple personalities, should we add a personality option for OpenMP that turns off binding?
- if user explicitly states "no binding" to the resource manager on a direct launch cmd line, should we disable our auto-binding?
- Leave the case where nothing is specified on direct launch alone - we will still bind by default in that case. Users can use the MCA param to request "do-not-bind", and admins can set that in the default MCA param file if they want
- Look for envars to indicate "bind-to-none" was given to the RM on direct launch, and don't bind if so
- Leave rest alone
- Howard: let's start discussing the features we want for v2.1.0.
- OpenMP/Open MPI interop - might be able to help support this via schizo framework
- Howard is experimenting with this now - sounds like schizo should provide the necessary hooks
- ...add your own favorite features here...
- Singularity container support
- PLEASE ADD FEATURES HERE
- OpenMP/Open MPI interop - might be able to help support this via schizo framework
- George: Per #1308, there's an ambiguity between the info key value that a user/application sets with MPI_COMM_INFO_SET and the value that is actually propagated to a child communicator via MPI_COMM_DUP_WITH_INFO: does it use the value that the user set, or the value that OMPI decided to use?
- Jeff proposed one possibility here
- This issue is coming up in front of the MPI Forum as a whole
- George / Dave: what are you going to propose to the Forum next week?
- General agreement between George, Dave, and Jeff about what shall be proposed next week (rest of us don't care!)
- Geoff:
mpirun --entry
method for wrapping multiple PMPI profiling libraries at runtime via fancy function pointer code. - Geoff:
mpirun --aff
yet another way to specify bindings. From Platform MPI, very similar to existing method at backend (parsing the strings obtained from hwload), but exposes different syntax viampirun -aff ____
. Notes : https://gist.github.com/markalle/52fb42c701267c49b450 - Geoff:
mpirun --prot
- print pt2pt connection methods used for process to process communication.- Ralph pointed out that some mapping of Platform-to-OMPI options could be accommodated via schizo framework. Will work with Geoff to see what makes sense and/or needs to be changed to make that feasible
- New options will be added by IBM via PR
- ☎️ Artem: Mellanox/LANL have raised a good point that customers do not want to have multiple different Open MPI installations in their environments (e.g., Vendor A OMPI and Vendor B OMPI and community OMPI).
- How can a customer have a single OMPI installation, but still have vendor/distribution-specific enhancements?
- Or is that the wrong question -- should we really be working to enable individual component distribution? This would disallow vendors from distributing patches to core.
- We don't take the patches in our tarball, but we did add the infrastructure so user's can download patches and then use this option to apply them on-site
- Mellanox is satisfied with this approach, so nothing needs to be done.
- Jeff: Per https://github.com/open-mpi/ompi/issues/1379, it looks like MPI processes are "becoming invisible" to the resource manager (SLURM, in Cisco's case). Ralph and Jeff are pretty sure that are some point in the past, we set the orted to create its own process group and launch all MPI processes in that. This means that if/when the RM tries to kill the process group that it launches, it won't kill any of the MPI processes (because they opted out / created their own process group).
- Ralph thinks that this may have been done so that we could deliver a signal to the entire MPI process group and not signal the orted. He remembers that this was a Sun-asked-for feature.
- We might not want that behavior by default any more -- for the reasons cited on https://github.com/open-mpi/ompi/issues/1379 (i.e., that old stale processes can get left around and not killed by the resource manager).
- Let's check the code and decide if we want to revisit this decision of creating a process group by default.
- How does is this affected when run under CGROUPs?
- Ralph will modify the orted to not put the application procs in a separate process group
- Please see https://www.open-mpi.org/community/lists/devel/2016/02/18612.php for explanation of the issue - this question needs to be revisited
- Jeff+George: discuss installation of monitoring test (https://github.com/open-mpi/ompi/pull/1390)
- Resolved: George to move the code to
ompi/contrib
and file a PR to move it to v2.1.0.
- Resolved: George to move the code to