Skip to content
George Bosilca edited this page Jan 29, 2015 · 93 revisions

January 2015 OMPI Developer's Meeting

This is a standalone meeting; it is not being held in conjunction with an MPI Forum meeting.

Logistics

Doodle for choosing the date: https://doodle.com/zzaupgxge9y6medu

  • Date: 9am Tuesday, January 27 through 3pm Thursday, January 29, 2015
  • Location: Cisco Richardson facility (outside Dallas), building 4:

Cisco Building 4
2200 East President George Bush Highway
Richardson, Texas 75082-3550

Google maps link: https://goo.gl/maps/SNrbu

Attendees

Local attendees:

  • (*) Jeff Squyres - Cisco
  • (*) Howard Pritchard - Los Alamos
  • (*) Ralph Castain - Intel
  • (*) George Bosilca - U. Tennessee, Knoxville
  • (*) Dave Goodell - Cisco
  • (*) Edgar Gabriel - U. Houston
  • (*) Vish Venkatesan (not Tuesday) - Intel
  • (*) Geoff Paulsen - IBM
  • (*) Joshua Ladd - Mellanox Technologies
  • (*) Rayaz Jagani - IBM
  • (*) Dave Solt - IBM
  • (*) Perry Schmidt - IBM
  • (*) Naoyuki Shida - Fujitsu
  • (*) Shinji Sumimoto - Fujitsu
  • (*) Stan Graves - IBM
  • (*) Mark Allen - IBM
  • ...please add your name if you plan to attend...

(*) = Registered (by Jeff)

Remote attendees

  • Nathan Hjelm - Los Alamos
  • Ryan Grant - Sandia (planning to attend for the MTL and 1.9 branch discussions)

Topics still to discuss

Thurs morning

  • Ralph: ORCM update
    • Roadmap
    • Instant On launch planning
  • All: Progress on thread-multiple support
  • Ralph: Collective switching points & MPI tuning params - what is required to change them. Had a discussion brought up by Mellanox, and we never finished this.
  • MTL issues (some of which might become moot...?):
    • Review note Jeff sent out yesterday about MTL idea
    • Intel/LANL: MTL selection issue (PSM vs. OFI)
    • Nathan: Enhance MTL interface to include one-sided and atomics

Deferred

  • Ralph: RTE-MPI sharing of BTLs

Since this will be a full meeting in itself, we'll have a good amount of time for discussion, design, and for hacking!

Resolved

  • Jeff/Howard: Branch for v1.9

    • See Releasev19 wiki page
    • We need to make a list of features for v1.9.0 to see if we're ready to branch yet
  • Jeff: libtool 2.4.4 bug / libltdl may no longer be embeddable. Should we embed manually, or should we just tell people to have libltdl-devel installed?

    • Resolved: let's stop embedding; we'll always link against external libltdl.
    • However: this means people need to have the libltdl headers installed (e.g., libltdl-devel RPM). We don't care about telling developers to do this, but we are a little worried about telling users to do this (because it raises the bar for building Open MPI -- the assumption that libltldl-devel is almost certainly not installed on most user machines).
    • The question becomes: what is configure's default behavior when it can't find ltdl.h?
      1. Abort
      2. Just fall back to --disable-dlopen behavior (i.e., slurp in plugins)
    • Let's bring up the "default behavior" issue as an RFC / beer discussion.
  • Jeff/Howard: Jenkins integration with Github:

    • how do we do multiple Jenkins servers? (e.g., running at different organizations)
    • much discussion in the room. Seems like a good idea to have multiple Jenkins polling github and running their own smoke tests. Need to figure out how to have them report results. Mike Dubman/Eugene V/Dave G will go investigate how to do this.
  • Howard/George: fate of coll ML

  • see http://www.open-mpi.org/community/lists/devel/2015/01/16820.php

  • who owns it?

  • should we try to fix it or disable by default?

  • Point was raised that coll/ml is very expensive during communicator creation -- including MPI_COMM_WORLD. Should we delete coll/ml? George asked Pasha; Pasha is checking.

  • Pasha: disable it for now, ORNL will fix and re-enable

  • DONE: George opal_ignore'd the coll/ml component

  • Ralph: Scalable startup, including:

    • Current state of opal_pmix integration
    • Async modex, static endpoint support
    • Re-define the role of PML/BTL add_procs: need to move to a more lazy-based setup of peers
    • Memory footprint reduction
    • Resolved:
    • Revive sparse groups
      • Edgar checked: passes smoke test today
      • first phase: replace ompi_proc_t array with pointer array to ompi_proc_t's
        • investigate further reduction in footprint
          • very simple, 1-way static setup of group hash, current optimize for MCW
    • remove add_procs from MPI_Init unless preconnect called
      • PML calls add_procs with 1 proc on first send to peer
        • need centralized method to check if we need to make a proc (must be thread safe)
        • may need to poll BTLs...etc. Expensive! Async? Must also be done thread safe
        • still a blocking call
        • Nathan: if one-sided calls BTLs directly, then need to check/call add_procs
      • call add_procs with all procs for preconnect-all and in connect/accept, or if PML component indicates it needs to add_procs with all procs
      • need to check with MTL owners on impact to them
      • will only add_procs a peer proc at most once before it is del_proc'd
    • del_procs needs to release memory and NULL the proc entry to ensure that you get NULL when you next look for the proc
    • differentiate between "I need a proc for..."
      • communication
      • non-communication
    • need to check BTL/MTLs to see how they handle messages from peers that we don't have an ompi_proc_t for
      • need way for BTL/MTL to upcall the PML with the message so the PML can create a new ompi_proc_t, call add_proc, handle message
  • COMM_SPLIT_TYPE PR: https://github.com/open-mpi/ompi/pull/326 -- what about IP issues?

  • Jeff added request to PR that the author mark it as released as BSD so we can properly ingest it

  • George to contact offlist to discuss enhancements

  • Edgar: extracting libnbc core from the collective component into a standalone directory such that it can be used from OMPIO and other locations

    • move the libnbc core portions into a subdirectory in ompi
    • modification to libnbc will include new read/write primitives as well as new send/recv primitives with an additional indirection level for buffer pointers.
  • Ralph: Review: v1.8 series / RM experience with Github and Jenkins and the release process

    • Ralph's feedback: lots more PRs than we used to have CMRs
    • Ralph's feedback: people seem to be relying on Jenkins for correctness, when Jenkins is really just a smoke test
    • Github fans will look at creating some helpful scrips to support MTT testing of PRs
  • Ralph: PMIx update

    • Given orally at meeting
  • Ralph: Data passing down to OPAL

    • Revising process naming scheme
    • MPI_Info
      • OPAL_info (renamed) object and typedef it at the OMPI layer
        • Dave Salt from IBM volunteered
        • Perry is going to ensure that IBM's Schedule A is up-to-date
    • Error response propagation (e.g., BTL error propagation up from OPAL into ORTE and OMPI, particularly in the presence of async progress).
      • Create opal_errhandler registration, call that function with errcode and remote process involved (if applicable) when encountering error that cannot be propagated upward (e.g., async progress thread)
        • Ralph will move the orte_event_base + progress thread down to OPAL
        • Ralph will provide opal_errhandler registration and callback mechanism
        • Ralph will integrate the pmix progress thread to the OPAL one
        • opal_event_base priority reservations:
          • error handler (top)
          • next 4 levels for BTLs
          • lowest 3 levels for ORTE/RTE
  • Howard: Progress on async progress

  • Nathan: --disable-smp-locks: remove this option?

    • See RFC email http://www.open-mpi.org/community/lists/devel/2015/01/16736.php
    • See, in particular, George's replies
    • In short: atomics are only used when multi-threading is enabled. But sm and vader need the smp locks.
    • However, people are discovering --disable-smp-locks, but this breaks sm/vader.
    • OMPI atomic functions:
      • CAPS versions: only enabled when opal_using_threads() is true, which is only true when set_opal_using_thread(true), which is only when we are MPI_THREAD_MULTIPLE
      • lower_case version: only on when --enable-smp-locks
    • George misunderstood the issue. Now he understands and agrees with Nathan: remove the --enable-smp-locks option.
  • Nathan: Performance of freelists and other common OPAL classes with OPAL_ENABLE_MULTI_THREADS==1 (as discussed in [GitHub]). Part of this is done already -- LIFO is a bit faster now (with threads), etc.

    • This is pretty much already resolved (after this item was added to the agenda) -- a fix went in on master for this, and a different fix went in for v1.8.
    • So the issue is now moot. Yay!
  • Vish: Memkind integration: see http://www.open-mpi.org/community/lists/devel/2014/11/16320.php

    • Vish has slides that he will post here.
    • We all generally agree that memkind introduces some new, desirable functionality
    • With some discussion in the room, it seems "easy" to to add this functionality to MPI_ALLOC_MEM/MPI_FREE_MEM.
    • We decided that it's quite hard to know how to use this internally in the rest of the OMPI code base right now. We assume we will want to use it; we just don't know how yet (there are many variables). So let's get some experience with memkind in MPI_ALLOC_MEM first and revisit how to use this internally in the rest of the code base.
    • Here's the 4 steps we think we need to do:
      1. remove "allocator" framework use from ob1, replace it with malloc (because the use of allocator there seems to be pretty useless)
      2. create new allocator modules for things like:
        • posix_memalign
        • mmap
        • malloc
        • ...?
      3. change the mpool framework/modules to use allocator modules to get memory
      4. update MPI_Alloc_mem to:
        • lazily create allocator modules from memkind when each memkind type requested
        • make an mpool with that allocator
        • allocate memory from the mpool associated with that memkind allocator type
        • (somehow) register the memory with all other mpools (e.g., mpools in use by the BTLs)
        • MPI_FREE_MEM needs to unregister with all mpools (probably already done?)
        • MPI_FREE_MEM needs to return the memory to the right mpool
    • Nathan and Vish will coordinate to move forward on this.
    • George and NAthan are digging in to ensure that allocator is not already being used in a way that will be problematic. ob1 usage seems to be understood / ok to change. sm mpool needs to be investigated -- it uses allocator, too.
  • Fujitsu: future plans for Open MPI development

    • Shinji will post slides here.
  • Ralph/Nathan: MTL overhead reduction

    • ...more...
  • Jeff: MPI extensions (and not-yet-published MPI symbols): MPIX_ prefix, or OMPI_ prefix?

    • Just a discussion between Jeff and George.
    • Resolved to have a "rule of thumb" about naming symbols in mpi-ext:
      • If the symbol is never intended to be something outside of OMPI (e.g., OMPI_Paffinity_str), give it an "OMPI_" prefix.
      • If the symbol is intended to be standardized -- i.e., other MPI implementations may pick it up (e.g., ULFM functionality), give it an "MPIX_" prefix.
      • If the symbol has passed at least one vote at the MPI Forum (and subjectively passed it "easily"), i.e., the symbol looks like it's going to get into an official MPI standard but just hasn't done so yet, give it an "MPI_" prefix.
      • Jeff added ompi/mpiext/README.txt file with this rule of thumb.

Presentation Material

Clone this wiki locally