Skip to content

Meeting 2016 02 Minutes

Geoff Paulsen edited this page Feb 24, 2016 · 53 revisions
  • Vender/Distributions should ALL use --with-ident-string.

    • Went around the room and asked each vendor to do this if they are not.
    • Change the code in git to be "Unknown distribution", and then at release time change it to "Open MPI Community"
      • ACTION: Jeff Squyres
  • Do we really need --enable-mpi-thread-multiple any more?

    • only 40 nano-seconds ping-pong 0 byte shared memory.
    • not many thread mulitple tests - handful of old sun thread mulitple tests.
    • MUCH discussion about.
    • ISSUE - OPAL layer might want to take advantage of threading, but can't if MPI layer doesn't support.
      • Can it be seperated? OPAL used libevent, because couldn't use OPAL threads.
    • ACTION: Nathan will create Pull Request so we can investigate - this will make this config option useless.
    • How many checks on lock? Request, Comm, free list, ____, probably a dozen times?
  • Git etiquette: GIT-ism Please have first line of GIT commit messages be: <area>: <= 50 characters of commit.

  • ACTION: Howard - Git have a template of "DON't submit issue here, instead use ompi/issues" in the ompi-release repo.

  • Discuss ways to help with creating pull requests for cherry picking master changes to other release branches?

    • Either use labels like Howard does with fork of libfabric for cray, or have it auto create an issue.
  • SLURM binding issue - srun myapp - if it sees there's no constraint on how it can be run, then if np <= 2 bind to core, np > 2 bind to socket.

    • If this app is a hybrid Open MP + MPI, this can mess them up.
    • If direct launching, SHOULD Open MPI be doing ANY binding?
  • Ralph - Suggests splitting up OPAL and ORTE and OMPI configure.ac stuff.

    • whenever someone touches these, Still some places refering to ompi in the opal layer.
    • Do we want to split these up, so that this is cleaner? Or continue where it's all mushed?
    • Generally sounds good. Resource issue, but we could do this gradually.
  • Discussion about configure - users not configuring what they think they have.

    • ompi_info - print out what was configured. ompi_info (-a doesn't print blurb).
    • ACTION: fix ompi_info -a to print out components.
    • TWO OPTIONS:
      • Summary blurb to end of configure and ompi_info to print frameworks: and list of components.
      • ACTION: Nathan will create a summary list of
      • SLURM -YES (maybe param)
      • TCP BTL - YES ...
    • ACTION: Geoff - come up with a script to grep through m4 files, and generate a platform file of all of the options possible, hopefully by ordering them and a bit of comments.
      • Version this platform file to work for a particular release.
    • Linux kernal does this the right way: yes, no, auto, no-auto.
      • Like a Platform File. Have a tool that exports a complete Platform File for ALL options.
      • Can't use Linux system because it's all GPL, but perhaps other systems out there.
    • configure --help is build from M4 system.
  • --host and --hostfile behavior (Ralph+Jeff)

    • PR 1353 seems to have gone too far.
    • Can we avoid a flip-flop between 1.10.x and 2.x?
    • NEW BEHAVIOR: --host when running under a resource manager and user didn't say -np X
      • No longer auto-compute number of procs, it's an ERROR.
      • add an option "filme up" that says to run as many procs as to fill up all slots on each node.
      • one way or another, you must specify number of processes.
      • If you don't specify the number of processes using any of these flags, then number of processes per node will be error.
      • -novm - implys Homogoneous cluster, and so mpirun can infer based on it's node.
      • SLURM has a regular expression to communicate whats on each node, so can compute.
      • otherwise, need to start up daemons to each detect and communicate back just to compute.
    • Some use cases (Fault Tollerant MPI) where users want to specify MORE hosts, but run on less hosts to start.
    • Check to see if command line request REQUIRES deeper node topogy info, then ask PMIx for the info, if PMIx can't give us info, then we'll launch DVM daemons to query that info.
    • Jeff documented -host / --hostfile behavior in PR 1353.
    • OKAY with IBM just adding Platform MPI --hostlist syntax to specify both number of procs and hostfile. And document.
      • ACTION: Geoff will make this happen and file PR.
  • 1.10.3 -host -np 2 will error saying OVERSUBSCRIBED.

    • Went in, reviewed, went through the cycle.
    • We hate this, but no one is going to fix.
  • SLURM direct launch auto-binding.

    • Resolved to leave srun myapp binding.
    • Resolved to fix srun --cpu_bind=none (can detect, and not bind)
    • Already works when users specify srun binding, or mpi bindings.
  • MPI / OpenMP hybrid, especially Nested OMP (#parallel do loops).

    • Howard presented some info about KNL 4 Hypertread per core with MANY cores.
    • Easy way for app writer to specify placement in Open MP: OMP_PLACES = {0,1,2,3} <- spec is very wishy washy.
  • PR 1308 - MPI_Comm_info_set/get - This is a mess, and if we implement according to the standard, the GET is fairly useless. George and Dave will write up the issues, and what Open MPI suggestes that the MPI Forum discuss and give clarity to. THEN "we" will implement THAT.

  • Features for 2.1.0 -

    • OpenMP/Open MPI interop -
    • Discussed --entry feature from Platform-MPI for loading and running mulitple PMPI_ profiling libraries.
      • Jeff thinks Giles changes Fortran to call PMPI_ for Open MPI 2.0 - because there are a small number of places that you have to know if you were called from C or Fortran.
      • Open MPI uses -runpath (LD_LIBRARY_PATH overwrites -rpath in some/most cases).
      • ACTION - Mark will create a RFC with his write up and some example code to discuss further.
    • Discussed --aff from Platform-MPI. Mostly just syntactic sugar on top of existing Open MPI commands.
      • Displayed some -aff and -prot output.
    • Discussed --prot. Platform-MPI prints in MPI_Init, but we don't know much until after the connections are demanded.
      • Had some old discussion about 3 levels of printing: modex(?), what we WANT to use, establish connections & print.
      • Nathan - could do some setting up of BTLs to ask "Is this peer reachable", with some new BTL functionality.
        • BTL send array in endpoint. OB1 rotates messages over endpoints.
      • Specifying the network interface using. Multi-rail would be REALLY useful to see which and how many are in use.
      • Nathan suggested adding this as downcall into PMLs, and have them report back up.
      • Jeff - Would like to see number of rails, and number of ports, and list of network interfaces, maybe addressing.
      • Suggestion: TCP2 - 2 TCP rails.
      • Suggestion: has to be optional, because launching / teardown. George points out much of info in Modex.
      • Suggestion: gather some info about what was actually used in Finalize.
      • Suggestion: Name, compare all names for BTLs - if there is an EXCEPTION print that out LOUDLY.
      • could print just * for each "node" in NxN graph.
  • Mellanox/LANL - customer don't want different multiple Open MPI installations in their environment.

    • Intent - Vendor would supply platform.path, and the customer would put it into a special directory. Got the mechanism, but didn't take the specific patches.
    • --with-platform=FILE - Mellanox can publish these patches, and customers can build with it.
    • Something shows up in ompi_info when they do --with-platform=FILE.

--- Wednesday ----

  • PMI-x
    • Two barriers to eliminate. One at beggining of Finalize - can't get rid of now.
    • One at end of MPI_Init.
    • BUT (Only for fabrics that can pre-compute endpoints).
    • Optional flag to say if want Barrier or not. Today PML selection is local decision.
  • PMI-x Fault Response APIs
    • MPI has some error registration capability.
    • in PMI-x, offered the capability for the application to describe what the response should be.
    • Reg Err (callback fnc) - MPI has a way to register callback.
    • PMI-x adding Error response.
    • One option would be if the app gets the callback, then the app can call one of PMIx Error handling functions.
    • Or can come up with some MPI wrappers to eventually push into standard.
    • Question: Does it have to be an MPI function? - No, but for Open MPI it's coming to the application through MPI API.
  • Status update on new MTT database / Schema
  • UCX - Collaborative
    • Would be nice to have UCX source be included in OMPI sources.
    • Precident for this: libfabric, PMI-x, hwloc, others.
      • Do it as a framework.
      • Must support internal / external. Have to do some configury work.
      • Same License.
  • Question - for subprojects, at a given level of OMPI say 2.1, can the subcomponent rev with new features?
    • There is precident: brought in a new version of hwloc for some bug fixes in 1.10.x, but that brought in some more features.
  • Names of high speed networks are very Open MPI specific: vader, OB1, CM, etc.
    • In addition, there are mulitple paths to a particular type of network through Open MPI.
    • Have talked about --[enable|disable] NETWORK_TYPE:QUALIFIER.
    • Now Tools want this same information.
    • What Platform-MPI does for Network types: MPI_IC_ORDER xxx,yyy,zzz,tcp (TCP is always last since it's slowest).
      • On command line can take lowercase (hint) or Uppercase (Demand) - -xxx, -XXX.
      • When doing "hybrid" approaches, like ROCE or IPoIB, command line looks like protocol, and some additional parameters to supply additional information when needed.
      • Seperate -intra=nic option to specify don't use shared memory for ranks on same node.
    • Open MPI is similar, but inclusive rather than exclusive.
    • https://www.open-mpi.org/community/lists/devel/2015/10/18154.php
    • Have the delema of choice.
    • Do Providers have preferences of paths? Ex: MxM BTL, MxM MTL, UCX, openib
      • Not a convient way to disable multiple paths to same thing.
    • Probably going to get into multiple conflicting use-cases.
    • Each component COULD register multiple short names. and disabling one short name would disable everything that registered that short name.
    • If you ask to include something that's not possible, it's an abort error.
    • If you exclude something that isn't possible continue running.
    • If user specifies "MxM" autodetection will determine if it uses PML or MTL based on priorities.
    • IDEA: Why not just assume shared memory and Self?
    • Direct Launch? Yes this will apply to both.
    • --net
    • Components can register multiple names.
    • ^ for disable?? can disable a "group" of protocols.
    • tricky one is libfabrics that can do
    • QUALIFIER - could specify path through Open MPI.
    • Conflicts should abort.
    • How do we override if a system admin put a variable in the default params file?
      • No way to zero out an mca paratemer set in a file.
      • Maybe we need to address THAT!
      • perhaps an 'unset' option.
    • Can add an explicit call for MCA base framework - components decide what grouping they are in.
    • Right now for an mca parameter, can't negate specific items in the list.. .only the entire list.
    • MCA base can resolve the multiple framework.
      • When components register themselves.
      • Call these "option-sets".
    • Do we need to specify an option that seperates runtime (ORTE) from MPI layer
    • Would be nice if this was syntactic sugar on top of what we have today, with also new names.
    • UUD component in OOB will register in Infiniband. User turns this off since they're thinking MPI.
      • Should this apply ONLY to MPI layer? What about OPAL layer?
      • RTE might need it.
      • As we move to point where runtimes / startup is using the fabric, we might want it to apply to both?
      • If you cut TCP - makes sense for MPI, but need it for IO, and other RTE stuff.
    • TCP o IB is complex too - looks like TCP, but runs over Infiniband.
    • lets not let this paralize us. Still some stuff we can do.
    • First step, lets take the first step. Don't get screwed up with TCPoIB.
    • Just apply to MPI, 1sided, pt2pt and coll traffic today!
    • In register, register short names, btls, mtls, OSC, COLLS, PMLs, MCA base will filter, R2 will have to filter, OSC will filter. Give it a component structure, and filter type, and will return a boolean if you can use it.
    • Any reason to specify self or shared memory? (cross memory attach, etc).
    • Okay with always imply self. but why is shared memory different?
    • How can the users figure out what the options are? If we use registration system, can use ompi_info.
    • Sticky because head node doesn't neccisarily.
    • libfabric - fi_info shows options - lists everything that's built in, UCX? We dont want to initialize library.

--- Wednesday after Lunch --

Takahiro Kawashari - Fujitsu presents. Fujitsu Roadmap - K Computer FX10 and FX100 in operation now.

  • Flagship 2020 project is post K computer.

  • 2016-2017 - Fujitsu MPI is OMPI 1.8.4 based. (Skipping 1.10, but backporting many bugfixes)

  • 2018 sometime Fujitsu MPI will update to OMPI 2.0.x

  • late 2019 Fujitsu MPI will move to OMPI 3.0.x - Post K computer (mid 2020). Post K computer - true use of several million processes. targeting < 1GB memory usage per process.

  • Open MPI 2.0 add-procs will be a big win here.

  • SPARC 64 IXFX - TWO CCNUMA NODES.

  • TOFU2 - 6D TORUS/MESH; 4 RDMA engines per node (put, get, atomic); Global barrier with reduction operation.

  • tofu BTL, tofu LLP

  • tofu specific collectives mtofu COLL.

  • tofu OSC. Progress thread on assistant core (OPAL runtime) - developed for tofu purpose (not same as Open MPI progt).

  • Tofu - Processess management (no orted)

  • FEFS (Fujitsu's filesystem) ad_fefs in ROMIO.

  • statistical information for tuning.... and others. Challenges:

  • Reducing memory consumption for exa-scale (10M)

  • Reducing MPI_Init/Finalize time for exa-scale

  • Collective communication algorithms tunes for the new interconnect.

  • Supporting the next MPI standard such as MPI-4.0?

  • many-core processing, power saving, etc. Collaboration:

  • Submitted PRs and patches for bug fixes.

  • Plan to merge some enhancements.

  • Develop new features and performance improvements

  • run MTT on our machines

  • Provide test programs.

  • Some source code contribution plan target for OMPI 2.1 or 3.0?

    • Statistical tuning.
  • Plan to reimplement statistical information based on MPI_T performance variable and contribute back.

  • Feedback PAPI has a system to expose software counters. PAPI will go through MPI_T.

  • Fujitsu MPI has timeout-based deadlock detection feature. (~ 1 minute)

    • --mca mpi_deadlock_timeout
    • not intellegent, but sufficient for most deadlock bugs. timing Requests call WAIT on.
  • MPI Function hook (1/3) like PMPI but at configure time instead of link or runtime.

    • Procedures are implemented as components of MFH framework.
    • can wrap ALL functions at MPI layer via CPP magic by defining
    • OMPI_MFH_FOO_DEFINE() in ompi/mca/mfh/foo/mfh_foo_call.h
    • which functions to hook are configured by #if.
    • can hook C or Fortran mpif.h or both, also Doesn't support F08 yet.
    • Originally designed for exclusion control between an application thread and a progress thread.
    • Fujitsu MPI is NOT thread safe, so needs exclusion control.
  • Shows the code for cpp magic. Massive file that includes lots of info (args, name, etc) for each MPI routine.

  • Other use cases. Statistical info; Handle errors; Log function calls for debug; check of MPI_THREAD_SERIALIZED.

  • Can install multiple hooks, defined at configure time.

  • Memory consumption MPI_Init / Finalize time. Big challenges for exa-scale.

  • Fujitsu will evaluate add_procs - improvement soon. it's great!

  • MPI-4 would like to see candiate features:

    • ULFM and endpoints - in Open MPI 2.1 or 3.0 and post K computer 2020.
    • ULFM: fault-tolerance.org - George is hoping pull it back in after PMI-x integration.
    • endpoints: thought it'd be accepted into MPI standard by now.
      • We don't see endpoints in MPI Forum next week, Open MPI is watching, seems plausable, but hasn't tried yet.
  • PMI-x - considering using PMIx API instead of Fujitsu proprietary API.

    • may collaborate to design exa-scale and fault tolerant capable API.
  • Fujitsu interestes: Memory footprint; launch scaling; ULFM; power control/management; common predefined info keys. Network topology information, etc.

  • Display graph of parallelizing MPI_Pack, can get good parallelization IF large blocks and many blocks, and extra cores to do the work.

  • Fujitsu MPI - OMPIO or ROMIO?

    • FEFS is based on Lustre - which to use?
    • OMPIO datatype engine is much better than ROMIO. Today Edgar says ROMIO.
  • Non-blocking collectives - libNBC? want to investigate.

    • No - just have it for conformance.
  • Planning to move to a new TOFU interface, and if/when they do that, then they can contribute BTL to community.

  • Fujitsu is not currently using hwloc, though hwloc WORKs on K computer. *Run MTT on SPARC64 cluster - maybe weekly or nightly.

  • still need negotiate with owner. Cannot guarantee.

  • SPARC64 based CPUs - 32core (FX100) or 16 core (FX10) per node.

  • ~ 100 nodes

  • Would Fujitsu be allowed to publish test results? - Yes they want to publish.

  • How can Fujitsu publish MTT if not directly connected to Internet? - Some Relay mechanism needed.

  • Currently only support MPI_THREAD_SERIALIZED with progress thread. Not MPI_THREAD_MULTIPLE.

  • Edgar reported some OMPI vs ROMIO performance numbers in various cases.

    • Bottom line is OMPI performance is quite on par, or better in many cases due to more efficent datatype representation and progression engine hooks (only for BTLs???)
  • Framework rename from mca_framework_component, name project_framework_component in package libdir. Shooting for Open MPI 2.1

    • A couple of ISVs that do this.
  • Multiple Libraries issue:

    • App links against 3 libs. orte, orcm, mpi.
    • ACTION: agreed to slurp 3 libraries back to 1 library, and hopefully only expose MPI symbols in libmpi.
  • Static Framework for licensing.

    • name it more generically: ompi/mca_hooks - mpi init_fini hooks.
    • opal_init - borth orted, and ranks - no communicators available. - Can use PMI-x to do communication.
      • opal_init_util - don't have to get key here.
    • Framework should live in opal, but can open the framework wherever.
    • orte_init - has communication. available.
    • use string instead of enum to plugin to framework to get hooks multiple.
      • many of many - allow multiple callbacks.
    • opal_showhelp something nice and abort - If want abort and don't want mpirun to show another error.
    • Don't worry about threading.
Clone this wiki locally