Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear sync state of fields for restart #1241

Merged
merged 3 commits into from
Mar 19, 2024
Merged

Conversation

psakievich
Copy link
Contributor

Sync issues are arising for fields that have been messed with prior to the populate_restart call. For restart, we are just reading from file so sync state should be clear when reading.

Closes #1191

Sync issues are arising for fields that have been messed with prior to the `populate_restart` call.  For restart, we are just reading from file so sync state should be clear when reading.
Copy link
Contributor

@alanw0 alanw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is probably reasonable if we're just replacing all field data by reading from file.

@lawrenceccheung
Copy link
Contributor

Thanks @psakievich. Note that the current fix generates this error message:

  >> 593    /projects/wind_uq/lcheung/NaluBuilds/exawind-manager.attaway.20240220/exawind-manager/environments/naluwindNateVerOF34patch/nalu-wind/src/Realm.C:3416:9: error: 'fld' was not declared in this scope

so I put it inside the loop:

        for (unsigned i = 0; i < numStates; ++i) {
          auto* fld = field->field_state(static_cast<stk::mesh::FieldState>(i));
          fld->clear_sync_state();
          fld->modify_on_host();
          ngp_field_manager().get_field<double>(fld->mesh_meta_data_ordinal());
          fld->sync_to_device();
        }

We're testing this out now, let's see if fixes the problems.

Lawrence

@alanw0
Copy link
Contributor

alanw0 commented Mar 7, 2024

It doesn't make sense to do both clear_sync_state and also modify/sync. Where is the field being modified? You should only do modify if you are modifying the field data. Alternatively you do clear_sync_state if you intend to overwrite the field data and you don't care which memory space has the most recent previous modification.

@psakievich
Copy link
Contributor Author

I'm second guessing this now too @alanw0. The modification is on L3419, but the states are all screwed up.

@alanw0
Copy link
Contributor

alanw0 commented Mar 11, 2024

I'm second guessing this now too @alanw0. The modification is on L3419, but the states are all screwed up.

What do you mean by 'screwed up'?
Do you think that the I/O read is failing to mark some fields or field-states as modify-on-host? We have modify_on_host calls in the I/O, but let us know if you think something is missing.
Perhaps the I/O is only reading state new, and marking that field as modified-on-host, but then state old is left un-marked? Or vice-versa? That should be ok, since it would get fixed the next time you do a state rotation, or the next time you need the field (of either state) on device, which is where sync_to_device would get called.

Keep us posted if you think stk is missing some calls or has things wrong.

@psakievich
Copy link
Contributor Author

psakievich commented Mar 11, 2024

@alanw0 #1190 shouldn't be happening is what I mean. I'm also not really sure why we need this per component code here either though. Seems like it is likely redundant. I also don't think we need a sync to device here either. That should happen at the calling points for the code using the fields on device. Seems like this could be excessive host-device communication.

Seems like we should be able to just delete all of this:

nalu-wind/src/Realm.C

Lines 3421 to 3435 in 56c0e6d

{
for (const auto& fname : outputInfo_->restartFieldNameSet_) {
auto* field = stk::mesh::get_field_by_name(fname, meta_data());
if (field == nullptr)
continue;
const unsigned numStates = field->number_of_states();
for (unsigned i = 0; i < numStates; ++i) {
auto* fld = field->field_state(static_cast<stk::mesh::FieldState>(i));
fld->clear_sync_state();
fld->modify_on_host();
ngp_field_manager().get_field<double>(fld->mesh_meta_data_ordinal());
fld->sync_to_device();
}
}
}

Provided the ioBroker marks fields as modified on host when it reads them in.

@psakievich psakievich enabled auto-merge (squash) March 19, 2024 16:17
@psakievich psakievich merged commit 7645bdc into master Mar 19, 2024
3 checks passed
@psakievich psakievich deleted the bug/sync-state-restart branch March 19, 2024 16:44
@lawrenceccheung
Copy link
Contributor

My apologies, I should have reported this earlier -- adding the fld->clear_sync_state(); inside the loop did fix the modify on host/device problems that Ken and Myra were seeing in their runs.

Lawrence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More 'uncleared modified_on_device' for temperature field
3 participants