Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Converted to the new STK simple_fields workflow" #1253

Closed
wants to merge 1 commit into from

Conversation

jrood-nrel
Copy link
Contributor

Reverts #1237

This appears to segfault in field operations when running on Frontier. I think we should revert it.

@jrood-nrel jrood-nrel requested a review from psakievich April 4, 2024 19:17
@psakievich
Copy link
Contributor

I do not want to revert this again. We need to get this in to move forward with trilinos. Let's debug it together.

Copy link
Contributor

@psakievich psakievich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fixing this without a revert is a pill we have to swallow. @djglaze let's loop you into the conversation.

@alanw0
Copy link
Contributor

alanw0 commented Apr 4, 2024

I think fixing this without a revert is a pill we have to swallow. @djglaze let's loop you into the conversation.

Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...

@psakievich
Copy link
Contributor

Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown...

Of course... Murphy and his ubiquitous law....
Well I am happy to try to track it down.

@djglaze
Copy link
Contributor

djglaze commented Apr 5, 2024

Arrrgh! Yeah, I'm on the road now between destinations. Sorry about that. I ran all the GPU tests successfully, so I guess our coverage might not be great.

I think I'd attack this with a CPU build using the STK_USE_DEVICE_MESH define. Then, I'd see if the GPU tests reveal anything. We can also then attack some tests with valgrind to see if anything comes up.

Dave

@jrood-nrel
Copy link
Contributor Author

It segfaults on the CPU.

@djglaze
Copy link
Contributor

djglaze commented Apr 10, 2024

@jrood-nrel @psakievich I'm back in town and available to help out with debugging this issue. I don't have access to Frontier anymore, which hinders thing a bit. Is there a stack trace from the seg-fault? The last one was figure-outable from just the stack trace, so hopefully that will be enough here.

I'm assuming this failure was on a huge problem. Is there any way you can post the input deck somewhere, so that I can see what models were running? The mesh would be great, too, although I'm guessing it's too big to easily share.

@psakievich
Copy link
Contributor

@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem

@djglaze
Copy link
Contributor

djglaze commented Apr 10, 2024

@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem

Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?

That makes sense that it might not be directly due to the size of the problem. I'm betting it's something tricky with a Field that's registered with different sizes on different parts of the domain, maybe due to mixed element topologies. It's probably a combination of things that we don't have representation for in our unit/regression tests. I've got enough experience with these simple_fields changes over the last couple years that there's a slim chance I can think it through if I can get my hands on a stack trace.

@jrood-nrel
Copy link
Contributor Author

#0  0x00007fffe7abc60f in void stk::mesh::field_fill<double>(double, stk::mesh::FieldBase const&) ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#1  0x00007fffe8210951 in sierra::nalu::NodalGradAlgDriver<stk::mesh::Field<double, void, void, void, void, void, void, void> >::pre_work() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#2  0x00007fffe820890e in sierra::nalu::NgpAlgDriver::execute() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#3  0x00007fffe7bd3527 in sierra::nalu::SpecificDissipationRateEquationSystem::assemble_nodal_gradient() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#4  0x00007fffe7b99aa8 in sierra::nalu::ShearStressTransportEquationSystem::solve_and_update() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#5  0x00007fffe7a623f4 in sierra::nalu::EquationSystems::solve_and_update() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#6  0x00007fffe7b75f2f in sierra::nalu::Realm::nonlinear_iterations(int) ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#7  0x00007fffe7b7491b in sierra::nalu::Realm::advance_time_step() ()
   from /lustre/orion/cfd162/proj-shared/jrood/exawind-manager/opt/linux-sles15-zen3/2024-03-29/linux-sles15-zen3/clang-15.0.0/nalu-wind-master-r4x5mc4j5ecgltvibvpdt76leudodj3b/lib/libnalu.so
#8  0x0000000000219f6f in exawind::NaluWind::advance_timestep(unsigned long) ()
#9  0x0000000000224ac4 in exawind::ExawindSolver::call_advance_timestep(unsigned long, bool) ()
#10 0x0000000000222a14 in exawind::OversetSimulation::run_timesteps(int, int, int) ()
#11 0x000000000022c40e in main ()

@psakievich
Copy link
Contributor

psakievich commented Apr 10, 2024

Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration?

@djglaze the exawind-driver is the top level code that is used to couple nalu-wind and amr-wind. That code links against the libnalu.so and reconstructs the guts of a Nalu::Simulation objects operations so they can be interwoven with amr-wind's solve process.

In the stacktrace above you can see the transition based on when the leading namespace switches from exawind to sierra.

@djglaze
Copy link
Contributor

djglaze commented Apr 13, 2024

I've been studying the stack trace that @jrood-nrel provided, and the code looks solid to me. I'm starting to wonder if the issue is elsewhere, and it's just manifesting at this location.

I've been unable (with only a modest amount of effort) to get current versions of spack-manager or exawind-manager working for my builds. The documentation is not current, and I was unable to view the Slack channel with more information because my accounts have lapsed. So, I'm running a version of spack-manager from a couple months ago, before the big modularization refactor and I can't see how things are currently being built. Are you guys still running a version of Trilinos pinned to 2023-02-28?

I hot-wired my local Trilinos version to be new-enough that I could attack it with the STK manual Field memory poisoning tool (that @psakievich is likely familiar with from the Sierra stk-field-asan dashboard line), and I found a few minor issues among all of the regression and unit tests. PR #1256 fixes what I found, although they are unlikely to be the cause of this seg-fault.

I've identified four STK commits from after 2023-02-28 that fix various potential memory corruption issues. These pre-existing issues were discovered while running all of the Sierra tests, using both the simple_fields changes (that you have) and the new variable-capacity Buckets changes (that you don't have). I think I'd strongly recommend moving your Trilinos version to something much newer, to get the benefit of these fixes as well as others that I undoubtedly missed. Something after 2023-07-12, Trilinos SHA 8be8abe06fec is preferred.

If you need to stay pinned to 2023-02-28, then I'd recommend applying 4 new Trilinos patches that correspond to these Sierra commits (that @psakievich has access to):

  • 296698c2b92e: Add a safety check to stk::mesh::field_copy()
    • This doesn't fix mismatched-size Field copy operations, but it adds a throw if one is detected. This might trip on a now-different-sized Field error. This is used in 29 places in nalu-wind.
  • d8a5bddc652c: Fixed zero initialization optimization in FieldBLAS::fill() function
    • This error writes a few bytes past the end of a Field when calling field_fill_component() or field_fill() with a non-unity stride and an initial offset past the beginning of the Field. nalu-wind uses problematic calls in at least 6 places. There are hundreds of additional field_fill() calls that should be safe, although it's suspicious that this is where the seg-fault occurred.
  • 71f879881ced: Fixed memory error when reading int Fields from disk
    • I don't know if you guys read int Fields from an Exodus file, but if you do, then this can write beyond the bounds of the Field.
  • 3fe12f190568: Fixed memory error when skinning elements with multiple side topologies
    • This error can read beyond the bounds of a data array. It can lead to bizarre behavior if you do skinning with multiple side topologies.

Moving my Trilinos version forward far enough to scoop up the memory debugging tool also included most of these fixes, so I never directly observed them being a problem in my local runs of all tests. Still, I think they are known issues that are good candidates for fixing John's seg-fault. Beyond this, I'm quickly running out of ideas.

@psakievich
Copy link
Contributor

@djglaze see #1255. I can help you get set up to build this week. I don't doubt that there was a struggle with the cee resources at the moment so sorry for the struggle on your end. The old version of spack-manager should be okay though.

@jrood-nrel
Copy link
Contributor Author

I'm still trying to test some large cases on Frontier, but the latest code runs for the most part for me on smaller problems and other machines, so regardless I agree to not actually reverting this.

@jrood-nrel jrood-nrel closed this Apr 17, 2024
@jrood-nrel jrood-nrel deleted the revert-1237-master branch April 17, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants