-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieve the modex of all jobs during accept/connect #11736
Conversation
FWIW: I tried this patch with the reproducer provided here by the original reporter, but it generated a bunch of failures for me: $ mpirun -n 4 ./spawn
1, after first barrier
1: Second barrier: MPI_SUCCESS: no errors
0, after first barrier
0: Second barrier: MPI_SUCCESS: no errors
3, after first barrier
3: Second barrier: MPI_SUCCESS: no errors
2, after first barrier
2: Second barrier: MPI_SUCCESS: no errors
4: spawned
3, after second barrier
2, after second barrier
1, after second barrier
0, after second barrier
4, after second barrier
2: Last barrier: MPI_SUCCESS: no errors
1: Last barrier: MPI_SUCCESS: no errors
3: Last barrier: MPI_SUCCESS: no errors
0: Last barrier: MPI_SUCCESS: no errors
4: Last barrier: MPI_SUCCESS: no errors
5: spawned
3, after last barrier
1, after last barrier
2, after last barrier
0, after last barrier
5, after last barrier
4, after last barrier
[Ralphs-iMac-2][[45637,1],2][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,1],1][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[45637,3],0]) is on host: Ralphs-iMac-2
Process 2 ([[45637,1],1]) is on host: unknown
BTLs attempted: tcp self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09882] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09882] *** Process received signal ***
[Ralphs-iMac-2:09882] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09882] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09882] Failing at address: 0x10
[Ralphs-iMac-2:09882] *** End of error message ***
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09881] *** Process received signal ***
[Ralphs-iMac-2:09881] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09881] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09881] Failing at address: 0x10
[Ralphs-iMac-2:09881] *** End of error message ***
5 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: Ralphs-iMac-2
Local PID: 9879
Peer host: Ralphs-iMac-2
--------------------------------------------------------------------------
1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
1 more process has sent help message help-mpi-btl-tcp.txt / peer hung up
|
@rhc54 based on the output ('after last barrier'), all the processes (including the newly spawned ones) have reached |
Hmmm...well, I don't have any MCA params set at all, so I don't think that can be the case. Just for grins, I tried So I went and tried a simple single-spawn test code and it worked fine: $ mpirun -n 4 ./simple_spawn
[prterun-Ralphs-iMac-2-13219@1:0 pid 13220] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:1 pid 13221] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:2 pid 13222] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:3 pid 13223] starting up on node Ralphs-iMac-2.local!
0 completed MPI_Init
Parent [pid 13220] about to spawn!
2 completed MPI_Init
Parent [pid 13222] about to spawn!
1 completed MPI_Init
Parent [pid 13221] about to spawn!
3 completed MPI_Init
Parent [pid 13223] about to spawn!
[prterun-Ralphs-iMac-2-13219@2:0 pid 13224] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:1 pid 13225] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:2 pid 13226] starting up on node Ralphs-iMac-2.local!
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent sending message to child
2 completed MPI_Init
Hello from the child 2 of 3 on host Ralphs-iMac-2.local pid 13226
1 completed MPI_Init
Hello from the child 1 of 3 on host Ralphs-iMac-2.local pid 13225
0 completed MPI_Init
Hello from the child 0 of 3 on host Ralphs-iMac-2.local pid 13224
Child 0 received msg: 38
Parent disconnected
Child 2 disconnected
Parent disconnected
Parent disconnected
Parent disconnected
Child 1 disconnected
Child 0 disconnected
13223: exiting
13221: exiting
13226: exiting
13222: exiting
13225: exiting
13220: exiting
13224: exiting
$ So it doesn't appear to be BTL related, but rather still something about the multiple spawn case. Anyway, just letting you know in case it is helpful. |
Here is the important snippet from your output
This shows that all the processes (4+1+1) completed the last |
I cherry-picked the commit in v5.0.x branch and can confirm that it fixes the issues also on my platform. I could not make it work on the main branch due to some problems compiling PMI, but they should be out of the scope of this PR |
The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged). This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex. Fixes open-mpi#11724. Signed-off-by: George Bosilca <[email protected]>
2bb9f0e
to
ba0bce4
Compare
@bosilca @abouteiller as soon as we get a review, and get a v5.0, we'll be happy to get this in |
We have some internal discussions on the most efficient way to gather the modex across multiple jobs. I'll get back to this later tomorrow. |
We decide to move forward with this as is, and if we notice a performance impact at scale we can reassess. |
ok great - please open up v5.0 cherry-pick |
The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged).
This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex.
Fixes #11724.