-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Converted to the new STK simple_fields workflow" #1253
Conversation
This reverts commit aa0617e.
I do not want to revert this again. We need to get this in to move forward with trilinos. Let's debug it together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fixing this without a revert is a pill we have to swallow. @djglaze let's loop you into the conversation.
Dave is out on vacation until the 10th, and then after that he has jury duty so his availability is going to be unknown... |
Of course... Murphy and his ubiquitous law.... |
Arrrgh! Yeah, I'm on the road now between destinations. Sorry about that. I ran all the GPU tests successfully, so I guess our coverage might not be great. I think I'd attack this with a CPU build using the STK_USE_DEVICE_MESH define. Then, I'd see if the GPU tests reveal anything. We can also then attack some tests with valgrind to see if anything comes up. Dave |
It segfaults on the CPU. |
@jrood-nrel @psakievich I'm back in town and available to help out with debugging this issue. I don't have access to Frontier anymore, which hinders thing a bit. Is there a stack trace from the seg-fault? The last one was figure-outable from just the stack trace, so hopefully that will be enough here. I'm assuming this failure was on a huge problem. Is there any way you can post the input deck somewhere, so that I can see what models were running? The mesh would be great, too, although I'm guessing it's too big to easily share. |
@djglaze last I checked with @jrood-nrel the failure was only observed so far when running the exawind driver. I was not able to reproduce locally. It had to do with field accessors so I don't think it is related to the size of the problem |
Thanks, @psakievich. I'm really out of the loop on this project, apparently. I don't know what "exawind driver" means. Is this a particular simulation configuration? That makes sense that it might not be directly due to the size of the problem. I'm betting it's something tricky with a Field that's registered with different sizes on different parts of the domain, maybe due to mixed element topologies. It's probably a combination of things that we don't have representation for in our unit/regression tests. I've got enough experience with these simple_fields changes over the last couple years that there's a slim chance I can think it through if I can get my hands on a stack trace. |
|
@djglaze the exawind-driver is the top level code that is used to couple nalu-wind and amr-wind. That code links against the In the stacktrace above you can see the transition based on when the leading namespace switches from |
I've been studying the stack trace that @jrood-nrel provided, and the code looks solid to me. I'm starting to wonder if the issue is elsewhere, and it's just manifesting at this location. I've been unable (with only a modest amount of effort) to get current versions of I hot-wired my local Trilinos version to be new-enough that I could attack it with the STK manual Field memory poisoning tool (that @psakievich is likely familiar with from the Sierra I've identified four STK commits from after 2023-02-28 that fix various potential memory corruption issues. These pre-existing issues were discovered while running all of the Sierra tests, using both the simple_fields changes (that you have) and the new variable-capacity Buckets changes (that you don't have). I think I'd strongly recommend moving your Trilinos version to something much newer, to get the benefit of these fixes as well as others that I undoubtedly missed. Something after 2023-07-12, Trilinos SHA If you need to stay pinned to 2023-02-28, then I'd recommend applying 4 new Trilinos patches that correspond to these Sierra commits (that @psakievich has access to):
Moving my Trilinos version forward far enough to scoop up the memory debugging tool also included most of these fixes, so I never directly observed them being a problem in my local runs of all tests. Still, I think they are known issues that are good candidates for fixing John's seg-fault. Beyond this, I'm quickly running out of ideas. |
I'm still trying to test some large cases on Frontier, but the latest code runs for the most part for me on smaller problems and other machines, so regardless I agree to not actually reverting this. |
Reverts #1237
This appears to segfault in field operations when running on Frontier. I think we should revert it.