Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve the modex of all jobs during accept/connect #11736

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Jun 6, 2023

The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex.

Fixes #11724.

@bosilca bosilca added the bug label Jun 6, 2023
@bosilca bosilca added this to the v5.0.0 milestone Jun 6, 2023
@bosilca bosilca self-assigned this Jun 6, 2023
@bosilca bosilca changed the title Ask for the modex of all jobs connected. Retrieve the modex of all jobs during accept/connect Jun 6, 2023
@rhc54
Copy link
Contributor

rhc54 commented Jun 6, 2023

FWIW: I tried this patch with the reproducer provided here by the original reporter, but it generated a bunch of failures for me:

$ mpirun -n 4 ./spawn
1, after first barrier
1: Second barrier: MPI_SUCCESS: no errors
0, after first barrier
0: Second barrier: MPI_SUCCESS: no errors
3, after first barrier
3: Second barrier: MPI_SUCCESS: no errors
2, after first barrier
2: Second barrier: MPI_SUCCESS: no errors
4: spawned
3, after second barrier
2, after second barrier
1, after second barrier
0, after second barrier
4, after second barrier
2: Last barrier: MPI_SUCCESS: no errors
1: Last barrier: MPI_SUCCESS: no errors
3: Last barrier: MPI_SUCCESS: no errors
0: Last barrier: MPI_SUCCESS: no errors
4: Last barrier: MPI_SUCCESS: no errors
5: spawned
3, after last barrier
1, after last barrier
2, after last barrier
0, after last barrier
5, after last barrier
4, after last barrier
[Ralphs-iMac-2][[45637,1],2][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,1],1][btl_tcp_endpoint.c:667:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[45637,1],0] expected [[45637,3],0]
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45637,3],0]) is on host: Ralphs-iMac-2
  Process 2 ([[45637,1],1]) is on host: unknown
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2][[45637,3],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09882] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09882] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09882] *** Process received signal ***
[Ralphs-iMac-2:09882] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09882] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09882] Failing at address: 0x10
[Ralphs-iMac-2:09882] *** End of error message ***
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2][[45637,2],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[Ralphs-iMac-2.local:09881] dpm_disconnect_init: error -12 in isend to process 1
[Ralphs-iMac-2.local:09881] Error in comm_disconnect_waitall
[Ralphs-iMac-2:09881] *** Process received signal ***
[Ralphs-iMac-2:09881] Signal: Segmentation fault: 11 (11)
[Ralphs-iMac-2:09881] Signal code: Address not mapped (1)
[Ralphs-iMac-2:09881] Failing at address: 0x10
[Ralphs-iMac-2:09881] *** End of error message ***
5 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: Ralphs-iMac-2
  Local PID:  9879
  Peer host:  Ralphs-iMac-2
--------------------------------------------------------------------------
1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
1 more process has sent help message help-mpi-btl-tcp.txt / peer hung up

@bosilca
Copy link
Member Author

bosilca commented Jun 7, 2023

@rhc54 based on the output ('after last barrier'), all the processes (including the newly spawned ones) have reached MPI_Finalize. The errors are related to 1) your MCA environment limits the BTL to TCP,selfand 2) the BTL TCP not being very MAC friedly. Adding sm to the list of allowed BTL should fix this issue.

@rhc54
Copy link
Contributor

rhc54 commented Jun 7, 2023

Hmmm...well, I don't have any MCA params set at all, so I don't think that can be the case. Just for grins, I tried mpirun -n 4 --mca btl tcp,sm,self ./spawn, but that failed with the same errors.

So I went and tried a simple single-spawn test code and it worked fine:

$ mpirun -n 4 ./simple_spawn
[prterun-Ralphs-iMac-2-13219@1:0 pid 13220] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:1 pid 13221] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:2 pid 13222] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@1:3 pid 13223] starting up on node Ralphs-iMac-2.local!
0 completed MPI_Init
Parent [pid 13220] about to spawn!
2 completed MPI_Init
Parent [pid 13222] about to spawn!
1 completed MPI_Init
Parent [pid 13221] about to spawn!
3 completed MPI_Init
Parent [pid 13223] about to spawn!
[prterun-Ralphs-iMac-2-13219@2:0 pid 13224] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:1 pid 13225] starting up on node Ralphs-iMac-2.local!
[prterun-Ralphs-iMac-2-13219@2:2 pid 13226] starting up on node Ralphs-iMac-2.local!
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent done with spawn
Parent sending message to child
2 completed MPI_Init
Hello from the child 2 of 3 on host Ralphs-iMac-2.local pid 13226
1 completed MPI_Init
Hello from the child 1 of 3 on host Ralphs-iMac-2.local pid 13225
0 completed MPI_Init
Hello from the child 0 of 3 on host Ralphs-iMac-2.local pid 13224
Child 0 received msg: 38
Parent disconnected
Child 2 disconnected
Parent disconnected
Parent disconnected
Parent disconnected
Child 1 disconnected
Child 0 disconnected
13223: exiting
13221: exiting
13226: exiting
13222: exiting
13225: exiting
13220: exiting
13224: exiting
$

So it doesn't appear to be BTL related, but rather still something about the multiple spawn case. Anyway, just letting you know in case it is helpful.

@bosilca
Copy link
Member Author

bosilca commented Jun 7, 2023

Here is the important snippet from your output

3, after last barrier
1, after last barrier
2, after last barrier
0, after last barrier
5, after last barrier
4, after last barrier

This shows that all the processes (4+1+1) completed the last MPI_)Barrier and went into MPI_Finalize. The fact that all processes were able to successfully complete a barrier, tells that the communication capabilities of the entire job are functional. Thus, you can safely ignore the errors, they are not related to this PR.

@Robyroc
Copy link

Robyroc commented Jun 8, 2023

I cherry-picked the commit in v5.0.x branch and can confirm that it fixes the issues also on my platform. I could not make it work on the main branch due to some problems compiling PMI, but they should be out of the scope of this PR

The original code was merging the local modex with the modex of the
local processes on the first jobid. This lead to incorrect, and
mismatched, information among processes when joining multiple jobid
processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect"
processes and adds their information to the local modex.

Fixes open-mpi#11724.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca force-pushed the topic/fix_multi_spawn branch from 2bb9f0e to ba0bce4 Compare August 15, 2023 16:35
@janjust
Copy link
Contributor

janjust commented Aug 17, 2023

@bosilca @abouteiller as soon as we get a review, and get a v5.0, we'll be happy to get this in

@bosilca
Copy link
Member Author

bosilca commented Aug 17, 2023

We have some internal discussions on the most efficient way to gather the modex across multiple jobs. I'll get back to this later tomorrow.

@bosilca
Copy link
Member Author

bosilca commented Aug 17, 2023

We decide to move forward with this as is, and if we notice a performance impact at scale we can reassess.

@bosilca bosilca merged commit 8e656d9 into open-mpi:main Aug 17, 2023
@bosilca bosilca deleted the topic/fix_multi_spawn branch August 17, 2023 15:13
@janjust
Copy link
Contributor

janjust commented Aug 17, 2023

ok great - please open up v5.0 cherry-pick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI_Intercomm_merge not working after second ULFM shrink.
5 participants