You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation
I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got
5A301-0407-G5500-12:89116] btl: tcp: attempting to connect() to [[62864,0],0] address 10.3.29.82 on port 1031
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: 5A301-0407-G5500-11
Local PID: 49564
Peer hostname: 5A301-0407-G5500-12 ([[62864,0],8])
Source IP of socket: 10.3.29.53
Known IPs of peer: a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
--------------------------------------------------------------------------
I investigated and it seems node 11 can not communicate with node 12.
One thing that bugs me is I don't know what a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d:: are, (yes IPv6) since:
the similar errors in the internet usually have alternative ipv4 IPs here
These IPv6 addresses can't be found anywhere when I do ip addr
I investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface
Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface
Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP
I have tried limiting the OMPI_MCA_btl_tcp_if_include to some values but only once the error disappeared but the process got stuck after that.
I am at lost how to proceed further
The text was updated successfully, but these errors were encountered:
As the error message tries to explain, peer 5A301-0407-G5500-12 has only published a set of ipv6 addresses, but it is trying to initiate a connection via ipv4. OMPI drops it, as the source address is not part of the known list of addresses.
It is definitively node 12 trying to connect to node 11. Looking in the code this message is generated in the accept code, so on node 11.
Are you interfaces correctly setup on both nodes ? The routing tables are correct ? You can try disabling ipv6, and then limit the traffic to the usual network (not sure what you understand by the control network on your setup).
I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation
I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got
I investigated and it seems node 11 can not communicate with node 12.
One thing that bugs me is I don't know what
a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
are, (yes IPv6) since:I investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface
Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface
Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP
I have tried limiting the
OMPI_MCA_btl_tcp_if_include
to some values but only once the error disappeared but the process got stuck after that.I am at lost how to proceed further
The text was updated successfully, but these errors were encountered: