Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: fix name compare bug #10655

Merged
merged 1 commit into from
Dec 21, 2024
Merged

prov/shm: fix name compare bug #10655

merged 1 commit into from
Dec 21, 2024

Conversation

aingerson
Copy link
Contributor

Could result in a peer getting incorrectly unmmaped

Could result in a peer getting incorrectly unmmaped

Signed-off-by: Alexia Ingerson <[email protected]>
@aingerson aingerson requested a review from shijin-aws December 20, 2024 18:26
@aingerson
Copy link
Contributor Author

@shijin-aws I'm going to cherry pick this into 2.0.x. Want to put it on your radar because the 2.0.0 has this bug and it could have implications for you.

@shijin-aws
Copy link
Contributor

@aingerson thanks! But I haven't seen any trouble in our side yet.... what is the impact of this bug?

@aingerson
Copy link
Contributor Author

@shijin-aws We saw similar seg faults on close in Intel MPI that you reported on Open MPI
Similar to this one you reported:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4547b07420 in smr_unmap_region (prov=0x7f4547bfb260 <smr_prov>, map=0x55c1c703b378, peer_id=0) at prov/shm/src/smr_util.c:506
506             if (peer_region->pid == getpid())
[Current thread is 1 (Thread 0x7f454d773fc0 (LWP 1545539))]
(gdb) bt
#0  0x00007f4547b07420 in smr_unmap_region (prov=0x7f4547bfb260 <smr_prov>, map=0x55c1c703b378, peer_id=0) at prov/shm/src/smr_util.c:506
#1  0x00007f4547b079e1 in smr_map_unmap (rbmap=0x55c1c703b398, node=0x55c1c715b490, context=0x0) at prov/shm/src/smr_util.c:601
#2  0x00007f4547b07acf in smr_map_del (map=0x55c1c703b378, shm_id=0) at prov/shm/src/smr_util.c:619
#3  0x00007f4547b04800 in smr_av_remove (av_fid=0x55c1c703b270, fi_addr=0x7f4537e33258, count=1, flags=0) at prov/shm/src/smr_av.c:220
#4  0x00007f4547a69a6b in fi_av_remove (av=0x55c1c703b270, fi_addr=0x7f4537e33258, count=1, flags=0) at ./include/rdma/fi_domain.h:531
#5  0x00007f4547a6d38b in efa_conn_rdm_deinit (av=0x55c1c70371e0, conn=0x7f4537e33220) at prov/efa/src/efa_av.c:358
#6  0x00007f4547a70a69 in efa_conn_release (av=0x55c1c70371e0, conn=0x7f4537e33220) at prov/efa/src/efa_av.c:555
#7  0x00007f4547a714b4 in efa_av_close_reverse_av (av=0x55c1c70371e0) at prov/efa/src/efa_av.c:794
#8  0x00007f4547a71591 in efa_av_close (fid=0x55c1c7037228) at prov/efa/src/efa_av.c:811
#9  0x00007f454c0c1000 in mca_btl_ofi_finalize () from /opt/amazon/openmpi/lib/openmpi/mca_btl_ofi.so
#10 0x00007f454d8fcddd in ?? () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#11 0x00007f454d8e488c in mca_base_framework_close () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#12 0x00007f454d8e488c in mca_base_framework_close () from /opt/amazon/openmpi/lib/libopen-pal.so.40
#13 0x00007f454dc46265 in ompi_mpi_finalize () from /opt/amazon/openmpi/lib/libmpi.so.40
#14 0x000055c1c603cd39 in main (argc=<optimized out>, argv=<optimized out>) at osu_get_acc_latency.c:115

@shijin-aws
Copy link
Contributor

Oh yeah I call it out in the original PR, but somehow it doesn't show up any more so I let it go.

@aingerson
Copy link
Contributor Author

@shijin-aws Yeah I thought we resolved it so I let it go too. Not sure why this only shows up in Intel MPI

@shijin-aws
Copy link
Contributor

shijin-aws commented Dec 20, 2024

@aingerson you disabled intel mpi's shm in your test? We used to do that but recently only test with their default shm setting (using ofi:shm)

@aingerson
Copy link
Contributor Author

@shijin-aws I'm not sure, the bug report came from IMPI but they must have disabled their internal shm to get here. If you only use IMPI shm for IMPI then this shouldn't be a big issue for you! But if you see any segfaults on finalize anywhere else then it's probably this

@aingerson aingerson merged commit 442fa89 into ofiwg:main Dec 21, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants