Skip to content

opal_progress guidelines

Thananon Patinyasakdikul edited this page Jan 30, 2019 · 16 revisions

Per discussion at 2018 OMPI dev meeting, we agreed to move forward to make opal_progress() multithreaded. This is the guideline to make your component in compliance with this change.

What will change?

OMPI will not serialize calls to opal_progress() anymore. This means your component's progress function might get simultaneously invoked from multiple threads. This change allows a communication component to be more efficient in multithreaded scenario if they choose to (ie, parallelize your component by creating multiple working lanes).

At this stage, we still serialize opal_progress() to give grace period for the components to adjust. We will be informing you via mailing list for the deadline that this change will take effect.

What components are affected ?

Not all OMPI MCA components will be affected by this change. Basically all communication components should be affected as well as any component that has registered a progress function. If your component works with timed triggers you might also be careful as the completion event might now be called from another thread, with opportunities for race conditions.

Do I have to modify my component?

This depends on how your component works. Let's break it down.

  • If your component is thread-safe.
    • If you want to enhance your multithreaded performance, YES, take a look at btl/uct or btl/ofi to really take advantage of this change.
    • If you just want to get by.
      • If calling progress from multiple threads will not affect performance, NO.
      • If calling progress from multiple threads will affect performance, Yes, goto HOW.
  • If your component is not thread-safe.
    • Maybe it is time to make it so?
    • If you want your component to be compliance, Yes, goto HOW.
    • If not, your component will have a hard time running in multithreaded mode. You might not care but someone will. Your component might create problem for others.

How?

Again, this is the minimum requirement from the component.

  • create a component mutex.
  • Throw a TRYLOCK around your original component_progress(). DO NOT USE LOCK.
typedef struct yourcomponent_t {
...
...
...
opal_mutex_t component_lock; /* Add a new mutex_t here and initialize it at init. */
} yourcomponent_t;

void yourcomponent_component_progress(void)
{
    if (!OPAL_THREAD_TRYLOCK(&component->component_lock)) {
        /* YOUR ORIGINAL PROGRESS ROUTINES HERE*/
        /* YOUR ORIGINAL PROGRESS ROUTINES HERE*/
        /* YOUR ORIGINAL PROGRESS ROUTINES HERE*/

        OPAL_THREAD_UNLOCK(&component->component_lock);
    }
}

Don't forget to initialize/cleanup the mutex!

What NOT to do?

Please refrain from blocking/waiting for the mutex in component_progress() as this will potentially stall every other component progression. You should return immediately if you cannot secure the mutex (use TRYLOCK as the example suggest). If your component is doing that, please fix it. The component that does not comply with this will get called out and have to do trial by combat at the next meeting. Note that this is just a general suggestion. It is a bad practice and has nothing to do with this change.

Clone this wiki locally