Skip to content

Meeting 2016 02 Minutes

Geoff Paulsen edited this page Feb 24, 2016 · 53 revisions
  • Vender/Distributions should ALL use --with-ident-string.

    • Went around the room and asked each vendor to do this if they are not.
    • Change the code in git to be "Unknown distribution", and then at release time change it to "Open MPI Community"
      • ACTION: Jeff Squyres
  • Do we really need --enable-mpi-thread-multiple any more?

    • only 40 nano-seconds ping-pong 0 byte shared memory.
    • not many thread mulitple tests - handful of old sun thread mulitple tests.
    • MUCH discussion about.
    • ISSUE - OPAL layer might want to take advantage of threading, but can't if MPI layer doesn't support.
      • Can it be seperated? OPAL used libevent, because couldn't use OPAL threads.
    • ACTION: Nathan will create Pull Request so we can investigate - this will make this config option useless.
    • How many checks on lock? Request, Comm, free list, ____, probably a dozen times?
  • Git etiquette: GIT-ism Please have first line of GIT commit messages be: <area>: <= 50 characters of commit.

  • ACTION: Howard - Git have a template of "DON't submit issue here, instead use ompi/issues" in the ompi-release repo.

  • Discuss ways to help with creating pull requests for cherry picking master changes to other release branches?

    • Either use labels like Howard does with fork of libfabric for cray, or have it auto create an issue.
  • SLURM binding issue - srun myapp - if it sees there's no constraint on how it can be run, then if np <= 2 bind to core, np > 2 bind to socket.

    • If this app is a hybrid Open MP + MPI, this can mess them up.
    • If direct launching, SHOULD Open MPI be doing ANY binding?
  • Ralph - Suggests splitting up OPAL and ORTE and OMPI configure.ac stuff.

    • whenever someone touches these, Still some places refering to ompi in the opal layer.
    • Do we want to split these up, so that this is cleaner? Or continue where it's all mushed?
    • Generally sounds good. Resource issue, but we could do this gradually.
  • Discussion about configure - users not configuring what they think they have.

    • ompi_info - print out what was configured. ompi_info (-a doesn't print blurb).
    • ACTION: fix ompi_info -a to print out components.
    • TWO OPTIONS:
      • Summary blurb to end of configure and ompi_info to print frameworks: and list of components.
      • ACTION: Nathan will create a summary list of
      • SLURM -YES (maybe param)
      • TCP BTL - YES ...
    • ACTION: Geoff - come up with a script to grep through m4 files, and generate a platform file of all of the options possible, hopefully by ordering them and a bit of comments.
      • Version this platform file to work for a particular release.
    • Linux kernal does this the right way: yes, no, auto, no-auto.
      • Like a Platform File. Have a tool that exports a complete Platform File for ALL options.
      • Can't use Linux system because it's all GPL, but perhaps other systems out there.
    • configure --help is build from M4 system.
  • --host and --hostfile behavior (Ralph+Jeff)

    • PR 1353 seems to have gone too far.
    • Can we avoid a flip-flop between 1.10.x and 2.x?
    • NEW BEHAVIOR: --host when running under a resource manager and user didn't say -np X
      • No longer auto-compute number of procs, it's an ERROR.
      • add an option "filme up" that says to run as many procs as to fill up all slots on each node.
      • one way or another, you must specify number of processes.
      • If you don't specify the number of processes using any of these flags, then number of processes per node will be error.
      • -novm - implys Homogoneous cluster, and so mpirun can infer based on it's node.
      • SLURM has a regular expression to communicate whats on each node, so can compute.
      • otherwise, need to start up daemons to each detect and communicate back just to compute.
    • Some use cases (Fault Tollerant MPI) where users want to specify MORE hosts, but run on less hosts to start.
    • Check to see if command line request REQUIRES deeper node topogy info, then ask PMIx for the info, if PMIx can't give us info, then we'll launch DVM daemons to query that info.
    • Jeff documented -host / --hostfile behavior in PR 1353.
    • OKAY with IBM just adding Platform MPI --hostlist syntax to specify both number of procs and hostfile. And document.
      • ACTION: Geoff will make this happen and file PR.
  • 1.10.3 -host -np 2 will error saying OVERSUBSCRIBED.

    • Went in, reviewed, went through the cycle.
    • We hate this, but no one is going to fix.
  • SLURM direct launch auto-binding.

    • Resolved to leave srun myapp binding.
    • Resolved to fix srun --cpu_bind=none (can detect, and not bind)
    • Already works when users specify srun binding, or mpi bindings.
  • MPI / OpenMP hybrid, especially Nested OMP (#parallel do loops).

    • Howard presented some info about KNL 4 Hypertread per core with MANY cores.
    • Easy way for app writer to specify placement in Open MP: OMP_PLACES = {0,1,2,3} <- spec is very wishy washy.
  • PR 1308 - MPI_Comm_info_set/get - This is a mess, and if we implement according to the standard, the GET is fairly useless. George and Dave will write up the issues, and what Open MPI suggestes that the MPI Forum discuss and give clarity to. THEN "we" will implement THAT.

  • Features for 2.1.0 -

    • OpenMP/Open MPI interop -
    • Discussed --entry feature from Platform-MPI for loading and running mulitple PMPI_ profiling libraries.
      • Jeff thinks Giles changes Fortran to call PMPI_ for Open MPI 2.0 - because there are a small number of places that you have to know if you were called from C or Fortran.
      • Open MPI uses -runpath (LD_LIBRARY_PATH overwrites -rpath in some/most cases).
      • ACTION - Mark will create a RFC with his write up and some example code to discuss further.
    • Discussed --aff from Platform-MPI. Mostly just syntactic sugar on top of existing Open MPI commands.
      • Displayed some -aff and -prot output.
    • Discussed --prot. Platform-MPI prints in MPI_Init, but we don't know much until after the connections are demanded.
      • Had some old discussion about 3 levels of printing: modex(?), what we WANT to use, establish connections & print.
      • Nathan - could do some setting up of BTLs to ask "Is this peer reachable", with some new BTL functionality.
        • BTL send array in endpoint. OB1 rotates messages over endpoints.
      • Specifying the network interface using. Multi-rail would be REALLY useful to see which and how many are in use.
      • Nathan suggested adding this as downcall into PMLs, and have them report back up.
      • Jeff - Would like to see number of rails, and number of ports, and list of network interfaces, maybe addressing.
      • Suggestion: TCP2 - 2 TCP rails.
      • Suggestion: has to be optional, because launching / teardown. George points out much of info in Modex.
      • Suggestion: gather some info about what was actually used in Finalize.
      • Suggestion: Name, compare all names for BTLs - if there is an EXCEPTION print that out LOUDLY.
      • could print just * for each "node" in NxN graph.
  • Mellanox/LANL - customer don't want different multiple Open MPI installations in their environment.

    • Intent - Vendor would supply platform.path, and the customer would put it into a special directory. Got the mechanism, but didn't take the specific patches.
    • --with-platform=FILE - Mellanox can publish these patches, and customers can build with it.
    • Something shows up in ompi_info when they do --with-platform=FILE.

--- Wednesday ----

  • PMI-x
    • Two barriers to eliminate. One at beggining of Finalize - can't get rid of now.
    • One at end of MPI_Init.
    • BUT (Only for fabrics that can pre-compute endpoints).
    • Optional flag to say if want Barrier or not. Today PML selection is local decision.
  • PMI-x Fault Response APIs
    • MPI has some error registration capability.
    • in PMI-x, offered the capability for the application to describe what the response should be.
    • Reg Err (callback fnc) - MPI has a way to register callback.
    • PMI-x adding Error response.
    • One option would be if the app gets the callback, then the app can call one of PMIx Error handling functions.
    • Or can come up with some MPI wrappers to eventually push into standard.
    • Question: Does it have to be an MPI function? - No, but for Open MPI it's coming to the application through MPI API.
  • Status update on new MTT database / Schema
  • UCX - Collaborative
    • Would be nice to have UCX source be included in OMPI sources.
    • Precident for this: libfabric, PMI-x, hwloc, others.
      • Do it as a framework.
      • Must support internal / external. Have to do some configury work.
      • Same License.
  • Question - for subprojects, at a given level of OMPI say 2.1, can the subcomponent rev with new features?
    • There is precident: brought in a new version of hwloc for some bug fixes in 1.10.x, but that brought in some more features.
  • Names of high speed networks are very Open MPI specific: vader, OB1, CM, etc.
    • In addition, there are mulitple paths to a particular type of network through Open MPI.
    • Have talked about --[enable|disable] NETWORK_TYPE:QUALIFIER.
    • Now Tools want this same information.
    • What Platform-MPI does for Network types: MPI_IC_ORDER xxx,yyy,zzz,tcp (TCP is always last since it's slowest).
      • On command line can take lowercase (hint) or Uppercase (Demand) - -xxx, -XXX.
      • When doing "hybrid" approaches, like ROCE or IPoIB, command line looks like protocol, and some additional parameters to supply additional information when needed.
      • Seperate -intra=nic option to specify don't use shared memory for ranks on same node.
    • Open MPI is similar, but inclusive rather than exclusive.
    • https://www.open-mpi.org/community/lists/devel/2015/10/18154.php
    • Have the delema of choice.
    • Do Providers have preferences of paths? Ex: MxM BTL, MxM MTL, UCX, openib
      • Not a convient way to disable multiple paths to same thing.
    • Probably going to get into multiple conflicting use-cases.
    • Each component COULD register multiple short names. and disabling one short name would disable everything that registered that short name.
    • If you ask to include something that's not possible, it's an abort error.
    • If you exclude something that isn't possible continue running.
    • If user specifies "MxM" autodetection will determine if it uses PML or MTL based on priorities.
    • IDEA: Why not just assume shared memory and Self?
    • Direct Launch? Yes this will apply to both.
    • --net
    • Components can register multiple names.
    • ^ for disable?? can disable a "group" of protocols.
    • tricky one is libfabrics that can do
    • QUALIFIER - could specify path through Open MPI.
    • Conflicts should abort.
    • How do we override if a system admin put a variable in the default params file?
      • No way to zero out an mca paratemer set in a file.
      • Maybe we need to address THAT!
      • perhaps an 'unset' option.
    • Can add an explicit call for MCA base framework - components decide what grouping they are in.
    • Right now for an mca parameter, can't negate specific items in the list.. .only the entire list.
    • MCA base can resolve the multiple framework.
      • When components register themselves.
      • Call these "option-sets".
    • Do we need to specify an option that seperates runtime (ORTE) from MPI layer
    • Would be nice if this was syntactic sugar on top of what we have today, with also new names.
    • UUD component in OOB will register in Infiniband. User turns this off since they're thinking MPI.
      • Should this apply ONLY to MPI layer? What about OPAL layer?
      • RTE might need it.
      • As we move to point where runtimes / startup is using the fabric, we might want it to apply to both?
      • If you cut TCP - makes sense for MPI, but need it for IO, and other RTE stuff.
    • TCP o IB is complex too - looks like TCP, but runs over Infiniband.
    • lets not let this paralize us. Still some stuff we can do.
    • First step, lets take the first step. Don't get screwed up with TCPoIB.
    • Just apply to MPI, 1sided, pt2pt and coll traffic today!
    • In register, register short names, btls, mtls, OSC, COLLS, PMLs, MCA base will filter, R2 will have to filter, OSC will filter. Give it a component structure, and filter type, and will return a boolean if you can use it.
    • Any reason to specify self or shared memory? (cross memory attach, etc).
    • Okay with always imply self. but why is shared memory different?
    • How can the users figure out what the options are? If we use registration system, can use ompi_info.
Clone this wiki locally