Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi5.0.6 - Multiple nodes cannot run ( Error: Connection timed out (110)) #12968

Open
hermitguo opened this issue Dec 8, 2024 · 0 comments

Comments

@hermitguo
Copy link

hermitguo commented Dec 8, 2024

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

Centos8.5 + slurm24.05.04 + hwloc-2.11.2 + libevent-2.1.12-stable + pmix-5.0.3 + ucx-1.17.0 + openmpi-5.0.6.

[user@master public]$ systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

[root@master ~]# ompi_info
Package: Open MPI root@master Distribution
Open MPI: 5.0.6
Open MPI repo revision: v5.0.6
Open MPI release date: Nov 15, 2024

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

#!/bin/bash
#SBATCH -p compute
#SBATCH --ntasks-per-node=64
#SBATCH --nodelist=master,node01

echo chkpt1
source /home/...
echo chkpt2
source /home/...
echo chkpt3
export PATH=/home/test/cp2k-2024.3/exe/local:$PATH
echo chkpt4

mpirun --mca btl_tcp_if_include ib0 --prefix /usr/local/lib/openmpi -np 128....

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:centos8.5

  • Computer hardware:
    slurmd -C
    NodeName=node01 CPUs=152 Boards=1 SocketsPerBoard=2 CoresPerSocket=38 ThreadsPerCore=2 RealMemory=257271

  • Network type:
    ib0 (Two nics directly connected, no switch).

[root@node01 mpi]# ip a
ib0

sbatch test.sh


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
The single node runs properly and two nodes run in parallel. The slurm-100.out file contains the following error

  1. sbatch test.sh
  2. slurm-100.out

WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.

Your Open MPI job may now hang or fail.

Local host: master
PID: 322039
Message: connect() to 172.16.0.193:1042 failed
Error: Connection timed out (110)

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants