Skip to content
Jeff Squyres edited this page Sep 2, 2014 · 3 revisions

Notes on Heterogeneous Support in Open MPI

As of Open MPI v1.2, the code base supports operation in heterogeneous environments. The run-time and MPI layer deal with heterogeneous operations differently, as appropriate for their usage. The run-time heterogeneous support focuses on providing maximum portability in all cases, while the MPI layer seeks to provide highest performance when peers are of the same architecture.

Run-time support

The bulk of run-time support for heterogeneous operations is built into the DSS service. All buffers are packed in network byte order, allowing two processes to communicate without knowing anything about each other. With undescribed buffers, sized types (long, sizet, pidt, etc.) are always packed in a fully-described format so that they can be unpacked on the remote process without the layer unpacking the buffer knowing what the remote architecture looks like.

The oob requires a minimal amount of heterogeneous support, common to what would be expected with a TCP application that runs in heterogeneous environments.

MPI support

The MPI layer has the requirement of extremely high performance, which is often in opposition to the requirements of heterogeneous support. Therefore, the heterogeneous support in Open MPI is a compile time option, and there is a C preprocessor define, OMPI*ENABLE*HETEROGENEOUS_SUPPORT that is 1 when heterogeneous support is requested and 0 otherwise. The current default is to support heterogeneous operations, but this might be changing in a future version of Open MPI. During initialization, an attempt is made to detect heterogeneous platforms and abort the job of heterogeneous support is not available.

Determining process architecture features

The datatype engine determines a number of characteristics about the architecture a given process is running on. This information is shared whenever ompi*proc*t information is shared (MPI*INIT and the dynamic process functions), and is available as a bitmask in any ompi*proc*t structure in the proc*arch field. proc*arch is an integer type of unspecified size (currently uint32*t, but that might change). Functions and bitmasks for decoding a given proc*arch are available in the header file ompi/datatype/dt*arch.h.

To determine if two processes have the same architecture, simply compare their proc_arch fields. For example, to see if proc proc has the same architecture as the local process:

if (proc->proc*arch == ompi*proc*local()->proc*arch) {
    /* same architecture */
} else {
    /* different architecture */
}

Frequently, especially when preparing headers for use, the concern is not whether the architecture is the same but whether the processes are of the same endianness. Headers are always sent in network byte order (big endian), so work is only required if the local process is little endian and the remote process is big endian. A flag is generally set on headers as to whether a header is in network byte order, as the current transport design does not supply a ompi*proc*t associated with the sender when a receive is completed. A typical usage, this time from the OB1 PML:

Sender

#if OMPI*ENABLE*HETEROGENEOUS_SUPPORT
#ifdef WORDS_BIGENDIAN
hdr->hdr*common.hdr*flags |= MCA*PML*OB1*HDR*FLAGS_NBO;
#else
/* if we are little endian and the remote side is big endian,
   we're responsible for making sure the data is in network byte
   order */
if (sendreq->req*send.req*base.req*proc->proc*arch & OMPI*ARCH*ISBIGENDIAN) {
    hdr->hdr*common.hdr*flags |= MCA*PML*OB1*HDR*FLAGS_NBO;
    MCA*PML*OB1*MATCH*HDR*HTON(hdr->hdr*match);
}
#endif
#endif

Receiver

#if defined(WORDS*BIGENDIAN) && OMPI*ENABLE*HETEROGENEOUS*SUPPORT
if (hdr->hdr*common.hdr*flags & MCA*PML*OB1*HDR*FLAGS_NBO) {
    MCA*PML*OB1*MATCH*HDR*NTOH(hdr->hdr*match);
}
#endif
mca*pml*ob1*recv*frag*match(btl, &hdr->hdr*match, segments,des->des*dst*cnt);

Note that we assume a big endian machine will never receive a header in little endian. Little endian machines are responsible for ensuring that they never send a header to a big endian machine without first converting it to network byte order.

Datatype layer support

Coming soon...

Modex transfer

At the time a process publishes modex information, it is not known if the application is running in a heterogeneous environment or not. Therefore, modex information is always published in network byte order. The mdoex information is published as blobs of binary data instead of using the OpenRTE datatype services, so it is up to the transport author to do the endian conversion. Currently, the code style is to provide NTOH and HTON macros for this conversion. For example, from the GM BTL:

struct mca*btl*mx*addr*t {
    uint64*t nic*id;
    uint32*t endpoint*id;
#if OMPI*ENABLE*HETEROGENEOUS_SUPPORT
    uint8_t padding[4];
#endif
};
typedef struct mca*btl*mx*addr*t mca*btl*mx*addr*t;

#define BTL*MX*ADDR_HTON(h)                     \
do {                                            \
    h.nic*id = hton64(h.nic*id);                \
    h.endpoint*id = htonl(h.endpoint*id);       \
} while (0)

#define BTL*MX*ADDR_NTOH(h)                     \
do {                                            \
    h.nic*id = ntoh64(h.nic*id);                \
    h.endpoint*id = ntohl(h.endpoint*id);       \
} while (0)

Note that padding rules (see next section) have to be followed as it's possible an array of the modex structures will be transfered in the modex.

Header sizing / Padding Rules

struct foo_t {
    uint32_t a;
    uint64_t b;
};

What is the value of sizeof(struct foo*t)? If you say 12, you'd be right sometimes. And if you say 16, you'd be right sometimes. Different compilers and architectures enforce different padding and alignment rules, which can cause problems for heterogeneous operations. For example, most 64 bit RISC systems will add 4 bytes of padding between a and b so that the 64 bit integer is 8 byte aligned (on PPC, the lack of alignment will slow the access down, but on UltraSPARC, it may lead to a bus error). However, on x86 systems (including x86*64), 64 bit integers can be aligned on 4 bytes, so there will be no padding between a and b. If struct foo_t was being sent from process a to b (or vice-versa), the receiving process would find an incorrect value for b because it would be looking in the wrong place.

To overcome this issue, padding is added between fields in the structure so that the most conservative alignment rules are always met without any implicit padding. This padding is always protected by #if OMPI*ENABLE*HETEROGENEOUS*SUPPORT, because padding implies more data transfer for x86*64 machines, which implies slower data transfers for the most common platform used with Open MPI. The most conservative alignment rule is:

  • A structure entry is aligned on a byte boundary equal to its size. So a uint16*t should be 2 bytes aligned, a uint32*t should be 4 bytes aligned, and a uint64_t should be 8 bytes aligned.

This gets incredibly complex when structures are contained within structures. For example, from the OB1 PML, we see:

struct mca*pml*ob1*common*hdr_t {
    uint8*t hdr*type;  /**< type of envelope */
    uint8*t hdr*flags; /**< flags indicating how fragment should be processed */
};
typedef struct mca*pml*ob1*common*hdr*t mca*pml*ob1*common*hdr*t;

struct mca*pml*ob1*match*hdr_t {
    mca*pml*ob1*common*hdr*t hdr*common;   /**< common attributes */
    uint16*t hdr*ctx;                      /**< communicator index */
    int32*t  hdr*src;                      /**< source rank */
    int32*t  hdr*tag;                      /**< user tag */
    uint16*t hdr*seq;                      /**< message sequence number */
#if OMPI*ENABLE*HETEROGENEOUS_SUPPORT
    uint8*t  hdr*padding[2];               /**< explicitly pad to 16 bytes. */
#endif
};
typedef struct mca*pml*ob1*match*hdr*t mca*pml*ob1*match*hdr*t;

struct mca*pml*ob1*frag*hdr_t {
    mca*pml*ob1*common*hdr*t hdr*common;     /**< common attributes */
#if OMPI*ENABLE*HETEROGENEOUS_SUPPORT
    uint8*t hdr*padding[6];
#endif
    uint64*t hdr*frag_offset;                /**< offset into message */
    ompi*ptr*t hdr*src*req;                  /**< pointer to source request */
    ompi*ptr*t hdr*dst*req;                  /**< pointer to matched receive */
};
typedef struct mca*pml*ob1*frag*hdr*t mca*pml*ob1*frag*hdr*t;

Padding is needed at the end of mca*pml*ob1*match*hdr*t because most compilers will pad the structure to a size that is a multiple of 4 so that an array of the structures is all properly aligned, but not all will do that when the structure is embedded in another structure -- better safe than sorry and be explicit. On the other hand, the padding in mca*pml*ob1*frag*hdr*t is in the middle of the structure and is necessary to properly align the hdr*frag*offset field. On most 64 bit RISC machines, the compiler would add 6 bytes of padding, but on 64 bit x86_64 machines, the compiler would likely add only 2 bytes of padding. This would clearly lead to badness on many systems, so we add the most conservative amount of padding.

Note: There is another option to the structure packing to adding padding and that is to influence the structure packing rules of the compiler to align data using a given set of rules (for example, enforce 4 byte alignment for 8 byte structure entries. However, the cost to RISC-like machines in accessing those structures is much greater than the cost of sending a few extra bytes (6 is currently the highest number of padding bytes added for a given message) across modern networks. Further, the compiler hints for setting packing rules vary from one compiler to another, so setting packing rules would be a difficult maintenance problem.

Mac OS X Note: The default GCC compiler on Intel Macs enforces padding rules that are identical to PowerPC machines running at the same word size. When compiling in 64 bit mode on the Intel Macs, the struct foo*t at the beginning of this section would have the same size as the structure on a 64 bit Power PC build, but different than the structure on a 64 bit Linux x86*64 box.

Clone this wiki locally