Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI 5.0.x can't initialize on x86 host on heterogeneous system #12947

Open
dfherr opened this issue Dec 1, 2024 · 20 comments
Open

MPI 5.0.x can't initialize on x86 host on heterogeneous system #12947

dfherr opened this issue Dec 1, 2024 · 20 comments

Comments

@dfherr
Copy link

dfherr commented Dec 1, 2024

I'm going to try my best to describe the issue. We tried to debug this internally with people working on pmix and didn't really get to a solution other than downgrading to openmpi 4.1.7 (which works as expected).

Background information

I am working on two x86 nodes running Rocky 9.1 with two Nvidia Bluefield-2 DPUs (one per node) running a recent Nvidia provided bfb image (Linux 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 to be precise).

The NIC/DPU is configured in Infininband mode and ssh connection between all 4 hosts is functional. Launching a simple MPI hello world works using openmpi 4.1.7 (tried with seperate installations of --without-ucx and --with-ucx=version 1.17.0).

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I tried 5.0.3, 5.0.5, 5.0.6 from the official tarbals from the openmpi download page. Each was compiled with --with-pmix=internal --with-hwloc=internal and then once with ucx 1.17.0 and without ucx. (i'm working with ucx so i wanted serperate mpi installs to compare).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

root installation:

$/usr/local/src: sudo tar -xvf openmpi
cd /usr/local/src/openmpi/
sudo ./configure --prefix=/opt/openmpi-5.0.6 --with-pmix=internal --with-hwloc=internal --without-ucx # or with --with-ucx=/opt/ucx-1.17.0
sudo make -j8 all
sudo make install

Please describe the system on which you are running

  • Operating system/version: Rocky 9.1 and ubuntu 22.04
  • Computer hardware: x86 and ARM processors
  • Network type: Infiniband

Details of the problem

I'm trying to launch mpi processes between the DPU and the host. the process starts on both ranks, the remote (dpu) rank finishes initialization and prints a debug output containing its rank, the remote stdout gets captured and arrives back at the mpirun host, but the x86 host never finishes initialization. With or without ucx available makes no difference.

host to host setup works. dpu to dpu setup works (when mpirun from host, NOT when mpirun from dpu) and host to dpu hangs. All commands from host:

# works host to host
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy09-ib,wolpy10-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld
#works dpu to dpu (started from host)
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy09-dpu-ib,wolpy10-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld
# hangs
/opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy10-ib,wolpy10-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 /var/mpi/dfherr/5.0.6/MPI_Helloworld

All of the above commands including starting mpirun on the DPUs work with openmpi 4.1.7 both with and without ucx compiled.

With additional debug output the hang seems to always occur after a dmdx key exchange was done:
look comment below for up-to-date debug output

--debug-daemons --leave-session-attached --mca odls_base_verbose 10 --mca state_base_verbose 10 
--prtemca pmix_server_verbose 10 --mca prte_data_server_verbose 100 --mca pmix_base_verbose 10 
--mca pmix_server_base_verbose 100 --mca ras_base_verbose 100 --mca plm_base_verbose 100

happy to provide further debug output. For now I'm fine running openmpi 4.1.7, but I felt I should report this issue with openmpi 5.0.x regardless.

@dfherr dfherr changed the title MPI 5.0.x can't initialized on host on hetereogenous system MPI 5.0.x can't initialize on x86 host on hetereogenous system Dec 1, 2024
@dfherr dfherr changed the title MPI 5.0.x can't initialize on x86 host on hetereogenous system MPI 5.0.x can't initialize on x86 host on heterogeneous system Dec 1, 2024
@rhc54
Copy link
Contributor

rhc54 commented Dec 1, 2024

Here is your problem:

[wolpy09:1073276] [prterun-wolpy10-1356013@0,1] DMODX REQ FOR prterun-wolpy10-1356013@1:0
[wolpy09:1073276] [prterun-wolpy10-1356013@0,1] DMODX REQ REFRESH FALSE REQUIRED KEY pml.ucx.5.0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv processing request from proc [prterun-wolpy10-1356013@0,1] for proc prterun-wolpy10-1356013@1:0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv checking for key pml.ucx.5.0
[wolpy09:1073276] prted/pmix/pmix_server_fence.c:268 MY REQ INDEX IS 0 FOR KEY pml.ucx.5.0
[wolpy10:1356013] [prterun-wolpy10-1356013@0,0] dmdx:recv key pml.ucx.5.0 not found - delaying

UCX is looking for a particular key that the other process is failing to provide - last time I saw this, it was because the OMPI version on one side was different from the version on the other side (i.e., the pml/ucx component has a different version, which means the key name is different since it has the version number in it). Afraid you'll need help from the OMPI UCX folks from here - has nothing to do with PMIx.

@dfherr
Copy link
Author

dfherr commented Dec 1, 2024

thanks @rhc54 for looking over this.

the thing is this does occur even when running without ucx compiled. looks like I also provided some outdated debug output with a ucx configuration error in there that was fixed (the hang error occurs on the fallback to btl.tcp later in that old output)

Here is a non-ucx output ran just now:

$ opt/openmpi-5.0.6/bin/mpirun --prefix /opt/openmpi-5.0.6 --host wolpy09-ib,wolpy09-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 --debug-daemons --leave-session-attached --mca odls_base_verbose 10 --mca state_base_verbose 10 --prtemca pmix_server_verbose 10 --mca prte_data_server_verbose 100 --mca pmix_base_verbose 10 --mca pmix_server_base_verbose 100 --mca ras_base_verbose 100 --mca plm_base_verbose 100 /var/mpi/dfherr/5.0.6/MPI_Hello
...
[wolpy09:2760807] [prterun-wolpy09-2760807@0,0] plm:base:receive done processing commands
[wolpy09:2760807] [prterun-wolpy09-2760807@0,0] plm:base:launch prterun-wolpy09-2760807@1 registered
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] DMODX REQ FOR prterun-wolpy09-2760807@1:0
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] DMODX REQ REFRESH FALSE REQUIRED KEY pml.base.2.0
[wolpy09-dpu:1707087] prted/pmix/pmix_server_fence.c:268 MY REQ INDEX IS 0 FOR KEY pml.base.2.0
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] dmdx:recv processing request from proc [prterun-wolpy09-2760807@0,2] for proc prterun-wolpy09-2760807@1:0
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] dmdx:recv checking for key pml.base.2.0
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] dmdx:recv key pml.base.2.0 found - retrieving payload
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] XMITTING DATA FOR PROC prterun-wolpy09-2760807@1:0
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] dmdx:recv response recvd from proc [prterun-wolpy09-2760807@0,1] with 106 bytes
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
rank 1 on Linux wolpy09-dpu 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 says hello
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] DMODX REQ FOR prterun-wolpy09-2760807@1:1
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] DMODX REQ REFRESH FALSE REQUIRED KEY btl.tcp.5.0
[wolpy09:2760815] prted/pmix/pmix_server_fence.c:268 MY REQ INDEX IS 0 FOR KEY btl.tcp.5.0
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] dmdx:recv processing request from proc [prterun-wolpy09-2760807@0,1] for proc prterun-wolpy09-2760807@1:1
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] dmdx:recv checking for key btl.tcp.5.0
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] dmdx:recv key btl.tcp.5.0 found - retrieving payload
[wolpy09-dpu:1707087] [prterun-wolpy09-2760807@0,2] XMITTING DATA FOR PROC prterun-wolpy09-2760807@1:1
[wolpy09:2760815] [prterun-wolpy09-2760807@0,1] dmdx:recv response recvd from proc [prterun-wolpy09-2760807@0,2] with 84 bytes
hangs...

notice the remote response inbetween.

And here is the with ucx output as comparison:

$ opt/openmpi-5.0.6-ucx/bin/mpirun --prefix /opt/openmpi-5.0.6-ucx --host wolpy09-ib,wolpy09-dpu-ib -np 2 --mca btl_tcp_if_include 10.12.0.0/16 --debug-daemons --leave-session-attached --mca odls_base_verbose 10 --mca state_base_verbose 10 --prtemca pmix_server_verbose 10 --mca prte_data_server_verbose 100 --mca pmix_base_verbose 10 --mca pmix_server_base_verbose 100 --mca ras_base_verbose 100 --mca plm_base_verbose 100 /var/mpi/dfherr/5.0.6-ucx/MPI_Hello
...
[wolpy09:2764323] [prterun-wolpy09-2764323@0,0] plm:base:receive done processing commands
[wolpy09:2764323] [prterun-wolpy09-2764323@0,0] plm:base:launch prterun-wolpy09-2764323@1 registered
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] FENCE UPCALLED ON NODE wolpy09
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] FENCE UPCALLED ON NODE wolpy09-dpu
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] FENCE UPCALLED ON NODE wolpy09
rank 1 on Linux wolpy09-dpu 5.4.0-1023-bluefield #26-Ubuntu SMP PREEMPT Wed Dec 1 23:59:51 UTC 2021 aarch64 says hello
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] DMODX REQ FOR prterun-wolpy09-2764323@1:1
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] DMODX REQ REFRESH FALSE REQUIRED KEY pml.ucx.5.0
[wolpy09:2764331] prted/pmix/pmix_server_fence.c:268 MY REQ INDEX IS 0 FOR KEY pml.ucx.5.0
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] dmdx:recv processing request from proc [prterun-wolpy09-2764323@0,1] for proc prterun-wolpy09-2764323@1:1
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] dmdx:recv checking for key pml.ucx.5.0
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] dmdx:recv key pml.ucx.5.0 found - retrieving payload
[wolpy09-dpu:1708293] [prterun-wolpy09-2764323@0,2] XMITTING DATA FOR PROC prterun-wolpy09-2764323@1:1
[wolpy09:2764331] [prterun-wolpy09-2764323@0,1] dmdx:recv response recvd from proc [prterun-wolpy09-2764323@0,2] with 382 bytes

As this occurs with and without ucx i assume it has nothing to do with that.

Here are ompi_info and pmix_info outputs from the non-ucx compiled version:

$ /opt/openmpi-5.0.6/bin/ompi_info                                                                                                                                                                            
                 Package: Open MPI root@wolpy09 Distribution
                Open MPI: 5.0.6
  Open MPI repo revision: v5.0.6
   Open MPI release date: Nov 15, 2024
                 MPI API: 3.1.0
            Ident string: 5.0.6
                  Prefix: /opt/openmpi-5.0.6
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: root
           Configured on: Fri Nov 29 12:33:08 UTC 2024
          Configure host: wolpy09
  Configure command line: '--prefix=/opt/openmpi-5.0.6' '--with-pmix=internal' '--with-hwloc=internal' '--without-ucx'
                Built by: root
                Built on: Fri 29 Nov 12:37:23 UTC 2024
              Built host: wolpy09
              C bindings: yes
             Fort mpif.h: no
            Fort use mpi: no
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 11.4.1
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: none
       Fort compiler abs: none
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: no
    Fort ISO_FORTRAN_ENV: no
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: no
 Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
 Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
   Fort mpif.h profiling: no
  Fort use mpi profiling: no
   Fort use mpi_f08 prof: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.6)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.6)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.6)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.6)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.6)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.6)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v5.0.6)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.6)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                MCA smsc: xpmem (MCA v2.1.0, API v1.0.0, Component v5.0.6)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.6)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: hcoll (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.6)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.6)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v5.0.6)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v5.0.6)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.6)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v5.0.6)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v5.0.6)
$ /opt/openmpi-5.0.6/bin/pmix_info     
                 Package: PMIx root@wolpy09 Distribution
                    PMIX: 5.0.4
      PMIX repo revision: v5.0.4
       PMIX release date: Unreleased developer copy
           PMIX Standard: 4.2
       PMIX Standard ABI: Stable (0.0), Provisional (0.0)
                  Prefix: /opt/openmpi-5.0.6
 Configured architecture: pmix.arch
          Configure host: wolpy09
           Configured by: root
           Configured on: Fri Nov 29 12:33:27 UTC 2024
          Configure host: wolpy09
  Configure command line: '--disable-option-checking'
                          '--prefix=/opt/openmpi-5.0.6'
                          '--without-tests-examples' '--enable-pmix-binaries'
                          '--disable-pmix-backward-compatibility'
                          '--disable-visibility' '--disable-devel-check'
                          '--disable-hwloc-lib-checks'
                          '--with-hwloc-extra-libs=/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/hwloc/libhwloc.la'
                          '--without-ucx'
                          'CPPFLAGS=-I/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/include
                          -I/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/include'
                          '--cache-file=/dev/null' '--srcdir=.'
                Built by: root
                Built on: Fri 29 Nov 12:36:01 UTC 2024
              Built host: wolpy09
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: "11" "." "4" "." "1"
  Internal debug support: no
              dl support: yes
     Symbol vis. support: no
          Manpages built: yes
              MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA gds: shmem2 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
           MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.4)
                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.4)
        MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v5.0.4)
        MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: mpich (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: default (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: pbs (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA psec: munge (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
                          v5.0.4)
             MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
               MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v5.0.4)
               MCA pstrg: vfs (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v5.0.4)

and from the ARM DPU:

$ /opt/openmpi-5.0.6/bin/ompi_info 
                 Package: Open MPI root@wolpy09-dpu Distribution
                Open MPI: 5.0.6
  Open MPI repo revision: v5.0.6
   Open MPI release date: Nov 15, 2024
                 MPI API: 3.1.0
            Ident string: 5.0.6
                  Prefix: /opt/openmpi-5.0.6
 Configured architecture: aarch64-unknown-linux-gnu
           Configured by: root
           Configured on: Fri Nov 29 12:35:12 UTC 2024
          Configure host: wolpy09-dpu
  Configure command line: '--prefix=/opt/openmpi-5.0.6' '--with-pmix=internal' '--with-hwloc=internal' '--without-ucx'
                Built by: root
                Built on: Fri Nov 29 12:50:28 UTC 2024
              Built host: wolpy09-dpu
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 9.4.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.6)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.6)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.6)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.6)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.6)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.6)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.6)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v5.0.6)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.6)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.6)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.6)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.6)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v5.0.6)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                  MCA op: aarch64 (MCA v2.1.0, API v1.0.0, Component v5.0.6)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.6)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.6)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.6)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.6)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v5.0.6)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v5.0.6)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.6)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.6)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v5.0.6)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v5.0.6)
$ /opt/openmpi-5.0.6/bin/pmix_info 
                 Package: PMIx root@wolpy09-dpu Distribution
                    PMIX: 5.0.4
      PMIX repo revision: v5.0.4
       PMIX release date: Unreleased developer copy
           PMIX Standard: 4.2
       PMIX Standard ABI: Stable (0.0), Provisional (0.0)
                  Prefix: /opt/openmpi-5.0.6
 Configured architecture: pmix.arch
          Configure host: wolpy09-dpu
           Configured by: root
           Configured on: Fri Nov 29 12:37:02 UTC 2024
          Configure host: wolpy09-dpu
  Configure command line: '--disable-option-checking'
                          '--prefix=/opt/openmpi-5.0.6'
                          '--without-tests-examples' '--enable-pmix-binaries'
                          '--disable-pmix-backward-compatibility'
                          '--disable-visibility' '--disable-devel-check'
                          '--disable-hwloc-lib-checks'
                          '--with-hwloc-extra-libs=/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/hwloc/libhwloc.la'
                          '--without-ucx'
                          'CPPFLAGS=-I/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/include
                          -I/usr/local/src/openmpi-5.0.6/3rd-party/hwloc-2.7.1/include'
                          '--cache-file=/dev/null' '--srcdir=.'
                Built by: root
                Built on: Fri Nov 29 12:42:22 UTC 2024
              Built host: wolpy09-dpu
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: "9" "." "4" "." "0"
  Internal debug support: no
              dl support: yes
     Symbol vis. support: no
          Manpages built: yes
              MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
              MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA gds: shmem2 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
           MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.4)
                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.4)
        MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v5.0.4)
        MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: mpich (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: default (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA prm: pbs (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
                          v5.0.4)
             MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v5.0.4)
             MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v5.0.4)
               MCA pstat: test (MCA v2.1.0, API v1.0.0, Component v5.0.4)
               MCA pstrg: vfs (MCA v2.1.0, API v1.0.0, Component v5.0.4)
                 MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v5.0.4)
                 MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v5.0.4)

There are quite a few differences in ompi_info due to one of them being automatically build with fortran support. the DPU additionalls has the a shortfloat extension. the DPU is also missing xpmem, hcoll and avx MCAs. other than that the gcc versions (which works when using openmpi 4.1.7)

The differences I can see in pmix is pstat test on the DPU instead of linux on the x86 host

@rhc54
Copy link
Contributor

rhc54 commented Dec 1, 2024

I don't know why one would insist this has something to do with PMIx or with mpirun - all the output reads as perfectly fine. Procs are started, info is requested and returned to the MPI layer. At that point, PMIx and the runtime are done.

What is odd is that OMPI is doing its fence during MPI_Init, but then using dmodex to retrieve the values - which implies that OMPI didn't exchange info during the fence. There are some flags for doing that, but I don't see them set on your cmd line (could perhaps be in your environment instead). Still, the data is being returned - how you got it doesn't matter in the end.

Anyway, I think you have a problem with the MPI layer - someone else will have to address it. This doesn't appear to have anything to do with PMIx or the runtime.

@dfherr
Copy link
Author

dfherr commented Dec 1, 2024

I didn't want to insist it has anything to do with pmix. I merely stated that I tried to debug it with colleagues who work on pmix, because they are experienced with mpi initialization, but we couldn't find the issue.

The only thing I know for sure is it's an issue with openmpi 5.0.x (specifically tested 5.0.6, 5.0.5, 5.0.3) and that using the exact same installation and runtime steps works with openmpi 4.1.7 (i literally executed the same commands in order from history just switching out version numbers)

@rhc54
Copy link
Contributor

rhc54 commented Dec 1, 2024

Understood - my only point was that I see no indication of any problems in the non-MPI areas. However, I also don't see any indication of an error during initialization, so I'm not sure where that conclusion is coming from. All I can see from the info provided thus far is that your x86 host isn't doing something you expect.

For the MPI folks here to help, it might be useful if you could explain a bit more about what the x86 host is failing to do. It sounds like you are saying it was supposed to print something immediately after MPI_Init, and didn't? Not entirely clear from the above. If it is getting stuck in MPI_Init, you'll need significantly more debug output from the MPI layer to determine where precisely it is sticking.

@dfherr
Copy link
Author

dfherr commented Dec 1, 2024

yes, to clarify:

int main(int argc, char** argv) {
    int rank;
    MPI_Init(&argc, &argv);
    printf("hello world");

hello world never prints on the host, while it does print on the dpu.

happy to rerun my tests with additional/different debug flags and provide the output, if anyone sends me what they need.

@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2024

You might want to put a \n in the hello world printf, and also possibly a fflush(stdout). Line-based output is not necessarily guaranteed to be flushed without that.

@dfherr
Copy link
Author

dfherr commented Dec 2, 2024

@jsquyres fair enough. I added it and reran the test. same result.

the full code looks like this so would've had a newline after a successful MPI_comm_rank.

int main(int argc, char** argv) {
    int rank;
    MPI_Init(&argc, &argv);
    printf("hello world\n");
    fflush(stdout);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    char* system_info = NULL;
    if(get_system_info(&system_info) != 0){
        perror("Failed to get system information.");
        system_info = NULL;
    }
    printf("rank %i on %s says hello\n", rank, system_info);
    free(system_info);

    MPI_Finalize();
    return 0;
}

(get systeminfo calls uname)

@bosilca
Copy link
Member

bosilca commented Dec 2, 2024

This is a heterogeneous setup. You need to compile OMPI with heterogeneous support. However, I know it will allow the OB1 to work properly (or at least it did not long ago), but I'm not sure the UCX PML supports that.

@dfherr
Copy link
Author

dfherr commented Dec 2, 2024

This is a heterogeneous setup. You need to compile OMPI with heterogeneous support. However, I know it will allow the OB1 to work properly (or at least it did not long ago), but I'm not sure the UCX PML supports that.

i'll recompile and see if that fixes it. any idea why that's not necessary for 4.1.7, but would be necessary for 5.0.x?

@dfherr
Copy link
Author

dfherr commented Dec 2, 2024

This is a heterogeneous setup. You need to compile OMPI with heterogeneous support.

@bosilca could you clarify what you mean with that and what I should do?

There is --enable-heterogeneous, but this is about endianess. Both systems are little endian.

Furthermore, the documentation https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/misc.html says:

image

So enabling that might be actively harmful and not necessary to begin with. So what should I compile with?

@ggouaillardet
Copy link
Contributor

@dfherr Here is what you can do in order to troubleshoot this problem

  1. run a non MPI program: mpirun ... hostname
  2. run a PMIx program: pmixcc 3rd-party/openpmix/examples/client.c; mpirun ... a.out
  3. run a MPI program: mpicc examples/ring_c.c; mpirun ... a.out
    which one(s) hang when running heterogeneous (e.g. running MPI tasks on both cpu and dpu)?
    If only the latter hang, what if you force communications over TCP:
    mpirun --mca pml ob1 --mca btl tcp,self ... a.out
    If it still hangs, can you collect stack traces to figure out where the application is stuck at?

@dfherr
Copy link
Author

dfherr commented Dec 4, 2024

@ggouaillardet

  1. this works as expected. mpirun itself isn't the problem. starting processes on both systems works
  2. I'm not sure how to do that within my installation. Are you suggesting I'm building pmix from source and trying that?
  3. The host hangs in MPI_init. the DPU continues and further output of the DPU is captured correctly by the mpirun process even while the host process hangs. Forcing it to run via tcp doesn't change (which isn't surprising considering it already used btl.tcp in the debug output i posted

The stack trace of the hang looks like this on the host:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f4b77fff640 (LWP 3053581) "async"):
#0  0x00007f4b7d90e21e in epoll_wait (epfd=23, events=0x7f4b77ffec70, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f4b7c481cbd in ucs_event_set_wait () from /lib64/libucs.so.0
#2  0x00007f4b7c46e999 in ucs_async_thread_func () from /lib64/libucs.so.0
#3  0x00007f4b7d889c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#4  0x00007f4b7d90ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f4b7d3ff640 (LWP 3053580) "MPI_Hello_world"):
#0  0x00007f4b7d90e21e in epoll_wait (epfd=8, events=0x11ecc40, maxevents=32, timeout=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f4b7dabc66c in epoll_dispatch.lto_priv () from /lib64/libevent_core-2.1.so.7
#2  0x00007f4b7dab74e1 in event_base_loop () from /lib64/libevent_core-2.1.so.7
#3  0x00007f4b7d4b7019 in progress_engine () from /opt/openmpi-5.0.6/lib/libpmix.so.2
#4  0x00007f4b7d889c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#5  0x00007f4b7d90ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f4b7de2e780 (LWP 3053579) "MPI_Hello_world"):
#0  0x00007f4b7d90e21e in epoll_wait (epfd=3, events=0x11c0730, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f4b7dabc66c in epoll_dispatch.lto_priv () from /lib64/libevent_core-2.1.so.7
#2  0x00007f4b7dab74e1 in event_base_loop () from /lib64/libevent_core-2.1.so.7
#3  0x00007f4b7d72d31f in opal_progress_events.isra () from /opt/openmpi-5.0.6/lib/libopen-pal.so.80
#4  0x00007f4b7d72d3d5 in opal_progress () from /opt/openmpi-5.0.6/lib/libopen-pal.so.80
#5  0x00007f4b7db780d1 in wait_completion () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#6  0x00007f4b7dae95fe in comm_allreduce_hcolrte_generic () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#7  0x00007f4b7dae9f30 in comm_allreduce_hcolrte () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#8  0x00007f4b7c3417b0 in hmca_bcol_ucx_p2p_init_query () from /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so
#9  0x00007f4b7db821b3 in hmca_bcol_base_init () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#10 0x00007f4b7db30a83 in hmca_coll_ml_init_query () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#11 0x00007f4b7db79275 in hcoll_init_with_opts () from /opt/mellanox/hcoll/lib/libhcoll.so.1
#12 0x00007f4b7e11f944 in mca_coll_hcoll_comm_query () from /opt/openmpi-5.0.6/lib/libmpi.so.40
#13 0x00007f4b7e0e8e37 in mca_coll_base_comm_select () from /opt/openmpi-5.0.6/lib/libmpi.so.40
#14 0x00007f4b7e08aad0 in ompi_mpi_init () from /opt/openmpi-5.0.6/lib/libmpi.so.40
#15 0x00007f4b7e0b9f7e in PMPI_Init () from /opt/openmpi-5.0.6/lib/libmpi.so.40
#16 0x000000000040113b in main (argc=<optimized out>, argv=<optimized out>) at MPI_Hello_world.cpp:8

@bosilca
Copy link
Member

bosilca commented Dec 4, 2024

Please try without the hcoll collective component, aka. --mca coll ^hcoll

@ggouaillardet
Copy link
Contributor

@dfherr Unless you built with an external PMIx library, the pmixcc wrappers are installed by Open MPI in the usual bin directory.

@dfherr
Copy link
Author

dfherr commented Dec 11, 2024

Please try without the hcoll collective component, aka. --mca coll ^hcoll

yes, that fixes the issue. Should I report the issue to hcoll developers somewhere?

@bosilca
Copy link
Member

bosilca commented Dec 11, 2024

HCOLL is part of HPCX (the NVIDIA HPC solution). @janjust do you know how to report a bug related to HCOLL ?

@janjust
Copy link
Contributor

janjust commented Dec 11, 2024

Consider it reported, but this specific setup (host/dpu) we will not fix, if it's an issue at all.
w/a is to simply disable HCOLL or build without it.

Having said that: this should work just fine. And you don't need to build with heterogenous support.
I have tested this specific setup on my own work and from what I recall it works fine. I know UCX has no issues mixing arm/x86 libs.

@dfherr
Copy link
Author

dfherr commented Dec 11, 2024

Having said that: this should work just fine. And you don't need to build with heterogenous support. I have tested this specific setup on my own work and from what I recall it works fine. I know UCX has no issues mixing arm/x86 libs.

i mean, yea it should work, but it doesn't. just to clarify. it's not specific to UCX. this happens with and without ucx compiled. i guess my mpi 4.1.7 works because it doesn't use hcoll?

@janjust
Copy link
Contributor

janjust commented Dec 11, 2024

i guess my mpi 4.1.7 works because it doesn't use hcoll?

yeah, looks that way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants