Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openMPI dropped inbound connection #12918

Open
KansaiTraining opened this issue Nov 12, 2024 · 1 comment
Open

openMPI dropped inbound connection #12918

KansaiTraining opened this issue Nov 12, 2024 · 1 comment

Comments

@KansaiTraining
Copy link

I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation

I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got

5A301-0407-G5500-12:89116] btl: tcp: attempting to connect() to [[62864,0],0] address 10.3.29.82 on port 1031
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          5A301-0407-G5500-11
  Local PID:           49564
  Peer hostname:       5A301-0407-G5500-12 ([[62864,0],8])
  Source IP of socket: 10.3.29.53
  Known IPs of peer:   a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
--------------------------------------------------------------------------

I investigated and it seems node 11 can not communicate with node 12.
One thing that bugs me is I don't know what a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d:: are, (yes IPv6) since:

  1. the similar errors in the internet usually have alternative ipv4 IPs here
  2. These IPv6 addresses can't be found anywhere when I do ip addr

I investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface
Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface

Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP

I have tried limiting the OMPI_MCA_btl_tcp_if_include to some values but only once the error disappeared but the process got stuck after that.
I am at lost how to proceed further

@bosilca
Copy link
Member

bosilca commented Nov 12, 2024

As the error message tries to explain, peer 5A301-0407-G5500-12 has only published a set of ipv6 addresses, but it is trying to initiate a connection via ipv4. OMPI drops it, as the source address is not part of the known list of addresses.

It is definitively node 12 trying to connect to node 11. Looking in the code this message is generated in the accept code, so on node 11.

Are you interfaces correctly setup on both nodes ? The routing tables are correct ? You can try disabling ipv6, and then limit the traffic to the usual network (not sure what you understand by the control network on your setup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants