Retrieve the modex of all jobs during accept/connect #11736

bosilca · 2023-06-06T03:54:15Z

The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex.

Fixes #11724.

rhc54 · 2023-06-06T22:31:00Z

FWIW: I tried this patch with the reproducer provided here by the original reporter, but it generated a bunch of failures for me:

$ mpirun -n 4 ./spawn
1, after first barrier
1: Second barrier: MPI_SUCCESS: no errors
0, after first barrier
0: Second barrier: MPI_SUCCESS: no errors
3, after first barrier
3: Second barrier: MPI_SUCCESS: no errors
2, after first barrier
2: Second barrier: MPI_SUCCESS: no errors
4: spawned
3, after second barrier
2, after second barrier
1, after second barrier
0, after second barrier
4, after second barrier
2: Last barrier: MPI_SUCCESS: no errors
1: Last barrier: MPI_SUCCESS: no errors
3: Last barrier: MPI_SUCCESS: no errors
0: Last barrier: MPI_SUCCESS: no errors
4: Last barrier: MPI_SUCCESS: no errors
5: spawned
3, after last barrier
1, after last barrier
2, after last barrier
0, after last barrier
5, after last barrier
4, after last barrier
[Ralphs-iMac-2][[45637,1],2][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,1],1][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45637,3],0]) is on host: Ralphs-iMac-2
  Process 2 ([[45637,1],1]) is on host: unknown
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09882] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09882] *** Process received signal ***
[Ralphs-iMac-2:09882] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09882] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09882] Failing at address: 0x10
[Ralphs-iMac-2:09882] *** End of error message ***
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09881] *** Process received signal ***
[Ralphs-iMac-2:09881] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09881] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09881] Failing at address: 0x10
[Ralphs-iMac-2:09881] *** End of error message ***
5 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: Ralphs-iMac-2
  Local PID:  9879
  Peer host:  Ralphs-iMac-2
--------------------------------------------------------------------------
1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
1 more process has sent help message help-mpi-btl-tcp.txt / peer hung up

bosilca · 2023-06-07T13:01:00Z

@rhc54 based on the output ('after last barrier'), all the processes (including the newly spawned ones) have reached MPI_Finalize. The errors are related to 1) your MCA environment limits the BTL to TCP,selfand 2) the BTL TCP not being very MAC friedly. Adding sm to the list of allowed BTL should fix this issue.

rhc54 · 2023-06-07T13:10:14Z

Hmmm...well, I don't have any MCA params set at all, so I don't think that can be the case. Just for grins, I tried mpirun -n 4 --mca btl tcp,sm,self ./spawn, but that failed with the same errors.

So I went and tried a simple single-spawn test code and it worked fine:

$ mpirun -n 4 ./simple_spawn
[prterun-Ralphs-iMac-2-13219@1:0 pid 13220] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:1 pid 13221] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:2 pid 13222] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:3 pid 13223] starting up on node Ralphs-iMac-2.local!
0 completed MPI_Init
Parent [pid 13220] about to spawn!
2 completed MPI_Init
Parent [pid 13222] about to spawn!
1 completed MPI_Init
Parent [pid 13221] about to spawn!
3 completed MPI_Init
Parent [pid 13223] about to spawn!
[prterun-Ralphs-iMac-2-13219@2:0 pid 13224] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:1 pid 13225] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:2 pid 13226] starting up on node Ralphs-iMac-2.local!
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent sending message to child
2 completed MPI_Init
Hello from the child 2 of 3 on host Ralphs-iMac-2.local pid 13226
1 completed MPI_Init
Hello from the child 1 of 3 on host Ralphs-iMac-2.local pid 13225
0 completed MPI_Init
Hello from the child 0 of 3 on host Ralphs-iMac-2.local pid 13224
Child 0 received msg: 38
Parent disconnected
Child 2 disconnected
Parent disconnected
Parent disconnected
Parent disconnected
Child 1 disconnected
Child 0 disconnected
13223: exiting
13221: exiting
13226: exiting
13222: exiting
13225: exiting
13220: exiting
13224: exiting
$

So it doesn't appear to be BTL related, but rather still something about the multiple spawn case. Anyway, just letting you know in case it is helpful.

bosilca · 2023-06-07T13:21:31Z

Here is the important snippet from your output

3, after last barrier
1, after last barrier
2, after last barrier
0, after last barrier
5, after last barrier
4, after last barrier

This shows that all the processes (4+1+1) completed the last MPI_)Barrier and went into MPI_Finalize. The fact that all processes were able to successfully complete a barrier, tells that the communication capabilities of the entire job are functional. Thus, you can safely ignore the errors, they are not related to this PR.

Robyroc · 2023-06-08T11:50:18Z

I cherry-picked the commit in v5.0.x branch and can confirm that it fixes the issues also on my platform. I could not make it work on the main branch due to some problems compiling PMI, but they should be out of the scope of this PR

The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged). This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex. Fixes open-mpi#11724. Signed-off-by: George Bosilca <[email protected]>

janjust · 2023-08-17T14:29:12Z

@bosilca @abouteiller as soon as we get a review, and get a v5.0, we'll be happy to get this in

bosilca · 2023-08-17T14:43:54Z

We have some internal discussions on the most efficient way to gather the modex across multiple jobs. I'll get back to this later tomorrow.

bosilca · 2023-08-17T15:13:37Z

We decide to move forward with this as is, and if we notice a performance impact at scale we can reassess.

janjust · 2023-08-17T15:14:32Z

ok great - please open up v5.0 cherry-pick

bosilca added the bug label Jun 6, 2023

bosilca added this to the v5.0.0 milestone Jun 6, 2023

bosilca self-assigned this Jun 6, 2023

github-actions bot added the Target: main label Jun 6, 2023

bosilca changed the title ~~Ask for the modex of all jobs connected.~~ Retrieve the modex of all jobs during accept/connect Jun 6, 2023

bosilca requested a review from abouteiller June 24, 2023 16:13

janjust added the Target: v5.0.x label Aug 15, 2023

github-actions bot removed the Target: v5.0.x label Aug 15, 2023

bosilca force-pushed the topic/fix_multi_spawn branch from 2bb9f0e to ba0bce4 Compare August 15, 2023 16:35

abouteiller approved these changes Aug 17, 2023

View reviewed changes

bosilca merged commit 8e656d9 into open-mpi:main Aug 17, 2023

bosilca deleted the topic/fix_multi_spawn branch August 17, 2023 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve the modex of all jobs during accept/connect #11736

Retrieve the modex of all jobs during accept/connect #11736

bosilca commented Jun 6, 2023

rhc54 commented Jun 6, 2023

bosilca commented Jun 7, 2023

rhc54 commented Jun 7, 2023

bosilca commented Jun 7, 2023

Robyroc commented Jun 8, 2023

janjust commented Aug 17, 2023

bosilca commented Aug 17, 2023

bosilca commented Aug 17, 2023

janjust commented Aug 17, 2023

Retrieve the modex of all jobs during accept/connect #11736

Retrieve the modex of all jobs during accept/connect #11736

Conversation

bosilca commented Jun 6, 2023

rhc54 commented Jun 6, 2023

bosilca commented Jun 7, 2023

rhc54 commented Jun 7, 2023

bosilca commented Jun 7, 2023

Robyroc commented Jun 8, 2023

janjust commented Aug 17, 2023

bosilca commented Aug 17, 2023

bosilca commented Aug 17, 2023

janjust commented Aug 17, 2023