Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU issue with SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP #1302

Closed
marchdf opened this issue Sep 23, 2024 · 12 comments
Closed

GPU issue with SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP #1302

marchdf opened this issue Sep 23, 2024 · 12 comments

Comments

@marchdf
Copy link
Contributor

marchdf commented Sep 23, 2024

I am having an issue with the following unit tests ./unittestX --gtest_filter="WallDistKernelHex8Mesh.NGP_wall_dist*"

They are throwing:

[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from WallDistKernelHex8Mesh
[ RUN      ] WallDistKernelHex8Mesh.NGP_wall_dist
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure
The difference between hostCalcValue(i, j) and exactValue[i][j] is 0.421875, which exceeds tol, where
hostCalcValue(i, j) evaluates to 0,
exactValue[i][j] evaluates to 0.421875, and
tol evaluates to 1.0000000000000001e-15.
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure
The difference between hostCalcValue(i, j) and exactValue[i][j] is 0.046875, which exceeds tol, where
hostCalcValue(i, j) evaluates to 0,
exactValue[i][j] evaluates to -0.046875, and
tol evaluates to 1.0000000000000001e-15.
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure

In digging further it appears that v_dndx(ip, ic, j) here is zero (v_scs_areav(ip, j) is non-zero). Why is that?

v_dndx comes from const auto& v_dndx = shiftPoisson_ ? meViews.dndx_shifted : meViews.dndx;, which should be defined by

  auto gradOp = shiftPoisson_ ? SCS_SHIFTED_GRAD_OP : SCS_GRAD_OP;
  dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES);

But why would those be zero? This test seems fine on CPU.

I did fix a (minor) thing in the way the expect_all_near was being called. But that's not the issue. See PR #1301.

Tagging @psakievich, @rcknaus and @alanw0 as people who might know what's going on.

@marchdf
Copy link
Contributor Author

marchdf commented Sep 23, 2024

Of course this looks fine with a Release build... The above was a Debug build.

@alanw0
Copy link
Contributor

alanw0 commented Sep 23, 2024

Marc, my initial thought was that there is a missing 'field.sync_to_host()' somewhere, which would explain why the 'hostCalcValue' might be zero when it is expected to be non-zero.
But I certainly don't know why it would be different between debug and release...
At the point where it is checking those values, you could add EXPECT_FALSE(field.need_sync_to_host());
(whatever the actual 'field' variable is).
Recall the usage pattern for these fields should always be to sync it to the current/desired memory space before accessing, and if writing to it, call field.modify_on_device()/modify_on_host().
The 'smart field' idea that Phil was deploying would automate that pattern, but I don't think he got that deployed everywhere yet.

@marchdf
Copy link
Contributor Author

marchdf commented Sep 23, 2024

Hi Alan, thanks for getting back to me so quickly. I will add the check you suggest.

@marchdf
Copy link
Contributor Author

marchdf commented Sep 24, 2024

@alanw0 I tried implementing the check but the places where this breaks don't have a need_sync_to_host() member function.

The problem really is that v_dndx = 0 in this expression lhsfac += v_dndx(ip, ic, j) * v_scs_areav(ip, j); (here). If I just hard code that to some value (or just leave v_scs_areav) then the hostCalcValue is non zero, indicating that everything is being copied back to host ok. v_dndx is non-zero for a CPU build or a Release build.

I am not familiar with these master element views and I am worried that somehow dndx asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doing dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES); is somehow not getting to GPU.

@marchdf
Copy link
Contributor Author

marchdf commented Sep 24, 2024

One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU: https://github.com/Exawind/nalu-wind/blob/master/unit_tests/UnitTestKokkosME.C#L296. It looks like that was done by you @alanw0 in 111af20. So maybe they are buggy? But I can't see why scs_areav would work but not dndx.

@alanw0
Copy link
Contributor

alanw0 commented Sep 24, 2024

One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU:

Memories are fuzzy, but as I recall we originally had a bunch of code that wouldn't work on GPU, and we gradually started converting it. At some point we did this ifdef'ing to guard the not-yet-converted code, and then we down-selected to only convert the code/algorithms that were specifically needed for the challenge problems, as time ran short.

@alanw0
Copy link
Contributor

alanw0 commented Sep 24, 2024

I am not familiar with these master element views and I am worried that somehow dndx asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doing dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES); is somehow not getting to GPU.

I'm going to tag James Overfelt, who was the master-element expert. @overfelt

@overfelt
Copy link
Contributor

I think these unit tests at one time compared the results of the old CPU only (Fortran based) master element functions to the newer CPU/GPU callable implementations. With the last refactor to get rid of all of the Fortran code these tests could be deleted.

@marchdf
Copy link
Contributor Author

marchdf commented Sep 24, 2024

@overfelt thank you. Do you see anything in the implementation of SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP that would explain the behavior I am seeing here: #1302 (comment)

@overfelt
Copy link
Contributor

The code looks fine to me. It is odd that most of the computed arrays are correct and only one is all zeros. Do you happen to know the value of shiftPoisson_ in this case?

@marchdf
Copy link
Contributor Author

marchdf commented Sep 24, 2024

know the value of shiftPoisson_ in this case

yes, it is true for one test, false for the other. Both fail.

@marchdf
Copy link
Contributor Author

marchdf commented Oct 24, 2024

Fixed in #1317

@marchdf marchdf closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants