Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.1.x] pml/cm,mtl/ofi: fix datatype offsetting #11907

Merged
merged 2 commits into from
Sep 8, 2023

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Sep 6, 2023

This reverts commit 6250f54 and manually picks 17b09d9

It fixes a segfault revealed by mtt ibm bsend test.

bot:notacherrypick

…thin one"

This reverts commit 6250f54.

The revert fixes a segfault revealed by mtt ibm bsend test.

bot:notacherrypick

Signed-off-by: Wenduo Wang <[email protected]>
@github-actions github-actions bot added this to the v4.1.6 milestone Sep 6, 2023
@wenduwan wenduwan added the bug label Sep 6, 2023
@wzamazon
Copy link
Contributor

wzamazon commented Sep 6, 2023

I do not think revert this patch is the right fix.

The original patch 6250f54 would fix a data corruption issue #11751, which would impact real world application like MPAS (a weather forecasting application)

@wzamazon
Copy link
Contributor

wzamazon commented Sep 6, 2023

If the patch caused segfault, you would need to fix the root cause.

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 6, 2023

@wzamazon Thanks for the reminder! Let me dig more into the segfault.

For the MPAS issue, do you happen to have the reproducer?

@hppritcha
Copy link
Member

so should this PR be closed then?

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 6, 2023

@hppritcha I'm looking into a fix today. I plan to reuse this PR. We need to fix bsend for the next 4.1 release IMO.

@wzamazon
Copy link
Contributor

wzamazon commented Sep 6, 2023

The reproducer is attached
alltoallw_custom_datatype.tar.gz

Can you add it to one of the test suite in ompi-tests?

@hppritcha
Copy link
Member

@wenduwan are you going to add the test case or do you want me to do this?

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 6, 2023

@hppritcha I'm still looking into the bug. The reproducer uses MPI_Alltoallw but I have a hunch that I can reduce it to send/recv.

Anyways I will add a test - which test suite do you recommend, ibm?

@hppritcha
Copy link
Member

please add to ibm.

This change picks 17b09d9

pml/cm and mtl/ofi on 4.x have diverged further from main. This change
manually applies the logic to handle datatype offsetting.

It addresses open-mpi#11751

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan wenduwan changed the title Revert "pml/cm: make heavy send request initialization to be same as … pml/cm,mtl/ofi: fix datatype offsetting Sep 7, 2023
@wenduwan wenduwan changed the title pml/cm,mtl/ofi: fix datatype offsetting [4.1.x] pml/cm,mtl/ofi: fix datatype offsetting Sep 7, 2023
@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 7, 2023

@wzamazon I figured that William has fixed this in 17b09d9 on main & 5.0.x so I added a commit to manually pick the relevant pieces.

With this PR both the reproducer and mtt bsend tests pass.

@wzamazon
Copy link
Contributor

wzamazon commented Sep 7, 2023

I figured that William has fixed this in 17b09d9 on main & 5.0.x so I added a commit to manually pick the relevant pieces.

Thank you! Approved!

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 7, 2023

@bwbarrett @jsquyres Can we include this change for 4.1.6 release?

@wenduwan
Copy link
Contributor Author

wenduwan commented Sep 7, 2023

Reproducer PR https://github.com/open-mpi/ompi-tests/pull/181

Wei's original example used Alltoallw, which happens to trigger the pml_cm_isend_init path. I find it simpler to use MPI_Send_init directly.

@jsquyres jsquyres merged commit 34796ca into open-mpi:v4.1.x Sep 8, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants