Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn C96C48_ufs_hybatmDA and C48mx500_3DVarAOWCDA into a regression test #3120

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from

Conversation

DavidNew-NOAA
Copy link
Contributor

@DavidNew-NOAA DavidNew-NOAA commented Nov 21, 2024

Description

This PR is a companion to GDASApp PR #1365 (merged).

It turns C96C48_ufs_hybatmDA and C48mx500_3DVarAOWCDA into a regression test using the JEDI application testing feature. This feature is turned on using the new DO_TEST_MODE parameter added to config.base in GW PR #3115. This parameter is set to "YES" in the yaml defaults for the JEDI-based CI tests in GW.

The motivation for this PR is a need to catch changes in JEDI which alter the outputs of our applications.

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? YES
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? YES
    • EMC verif-global
    • GDAS PR #1365 (merged)
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

C96C48_hybatmDA, C96C48_ufs_hybatmDA, C96C48_hybatmaerosnowDA, C48mx500_3DVarAOWCDA, and C48mx500_hybAOWCDA have been tested successfully on Hera

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

Copy link
Contributor

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once the change has been made to use the same testing option flag as the other PR

@DavidNew-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA Done!

@DavidNew-NOAA
Copy link
Contributor Author

I misunderstood how default yamls work. I'm merging Cory's branch into mine and redoing the test references

DavidNew-NOAA added a commit to NOAA-EMC/GDASApp that referenced this pull request Nov 25, 2024
This PR is a companion to GW PR
[#3120](NOAA-EMC/global-workflow#3120).

It does a couple things:

1. 5 GW CI tests are added/extended as CTests in GDASApp, running
through to the fcst jobs in the first full-cycle. These CI tests are:
```C96C48_hybatmDA```, ```C96C48_ufs_hybatmDA```,
```C96C48_hybatmaerosnowDA```, ```C48mx500_3DVarAOWCDA```, and
```C48mx500_hybAOWCDA```.
2. Test references are added for ```C96C48_ufs_hybatmDA``` and
```C48mx500_3DVarAOWCDA```, so that we're actually testing the output.
3. These CTests are turned on by default in a workflow build, rather
than having to mess with the ```CMakeCache.txt``` file and re-running
make. This will allow us to use these tests in nightly testing.
4. ```test/gw-ci/CMakeLists.txt``` is refactored quite a bit.
5. There are 89 CTests, but for 5 CI tests, but I added task
dependencies, so they can be run in parallel.

The primary motivation for this PR is that we can run CI for our nightly
testing of GDASApp. Also, anyone with a PR can easily do CI testing
through CTests.

---------

Co-authored-by: Russ-Treadon-NOAA <[email protected]>
@DavidNew-NOAA DavidNew-NOAA changed the title Turn C96C48_ufs_hybatmDA into a regression test Turn C96C48_ufs_hybatmDA and C48mx500_3DVarAOWCDA into a regression test Nov 25, 2024
@DavidNew-NOAA DavidNew-NOAA marked this pull request as ready for review November 25, 2024 23:38
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installed DavidNew-NOAA:feature/gw-ci at f74c93b on Hera. Run test_gdasapp ctests. All 133 tests pass. Run with 12 threads. All tests complete in 3479.51 seconds.

Copy link
Contributor

@AndrewEichmann-NOAA AndrewEichmann-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections

@RussTreadon-NOAA
Copy link
Contributor

g-w CI tests on Hera

The C96C48_ufs_hybatmDA 20240224 00Z gfs_atmanlvar job failed due to the reference check

 0: OOPS_STATS Run end                                  - Runtime:    751.37 sec,  Memory: total:   632.81 Gb, per task: min =     6.27 Gb, max =     7.60 Gb
 0: Run: Finishing oops::Variational<FV3JEDI, UFO and IODA observations> with status = 0
 0: terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
 0:   what():  Test reference Float mismatch @ Line:2
 0: Test Val : 5.6385719460046617e+05
 0: Ref  Val : 6.0590797038184456e+05
 0: Delta    : 4.2050775781378383e+04
 0: Relative tolerance: 5.8488258249115540e+02
 0: Absolute tolerance: 1.0000000000000001e-05
 0: Test Line: 'CostJo   : Nonlinear Jo(Aircraft) = 5.6385719460046617e+05, nobs = 476911, Jo/n = 1.1823111536543844e+00, err = 2.2308811279218053e+00'
 0: Ref Line : 'CostJo   : Nonlinear Jo(Aircraft) = 6.0590797038184456e+05, nobs = 504555, Jo/n = 1.2008759607611550e+00, err = 2.2437239435917040e+00'
srun: error: h35m09: task 0: Aborted (core dumped)

The gfs and gdas use different data dumps. The gdas analysis assimilates more data than the gfs.

We need two reference check files - one for gdas_atmanlvar and another for gfs_atmanlvar. If we don't want to test gfs_atmanlvar, we need to change the yaml used by the gfs_atmanlvar job.

@DavidNew-NOAA
Copy link
Contributor Author

Thanks, @RussTreadon-NOAA. I'll look into how to fix this

@RussTreadon-NOAA
Copy link
Contributor

Installed DavidNew-NOAA:feature/gw-ci at d8929d2 on Hera.

Run the following g-w CI configurations

  • C96C48_hybatmDA - prgsi_pr3120
  • C96C48_ufs_hybatmDA - prjedi_pr3120
  • C96C48_hybatmaerosnowDA - praero_pr3120
  • C48mx500_3DVarAOWCDA - prwcda_pr3120

with the following results

rocotostat /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prgsi_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Nov 26 2024 14:20:16    Nov 26 2024 14:40:17
202112210000        Done    Nov 26 2024 14:20:16    Nov 26 2024 18:30:30
202112210600        Done    Nov 26 2024 14:20:16    Nov 26 2024 17:25:16

rocotostat /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prjedi_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Nov 26 2024 14:20:17    Nov 26 2024 14:40:19
202402240000      Active    Nov 26 2024 14:20:17             -
202402240600      Active    Nov 26 2024 14:20:17             -

rocotostat /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/praero_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Nov 26 2024 14:20:19    Nov 26 2024 14:40:21
202112201800        Done    Nov 26 2024 14:20:19    Nov 26 2024 18:40:21
202112210000        Done    Nov 26 2024 14:20:19    Nov 26 2024 17:25:20

rocotostat /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prwcda_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Nov 26 2024 14:20:21    Nov 26 2024 14:40:22
202103241800        Done    Nov 26 2024 14:20:21    Nov 26 2024 15:55:28

Three of the four streams successfully ran all jobs. The C96C48_ufs_hybatmDA (prjedi_pr3120) failure is described above.

DavidNew-NOAA added a commit to NOAA-EMC/GDASApp that referenced this pull request Nov 27, 2024
This PR addresses the bug @RussTreadon-NOAA found that
```gfs_atmanlvar``` was being run as a regression test and using the
same test reference as ```gdas_atmanlvar``` in GW PR
[#3120](NOAA-EMC/global-workflow#3120).

See
NOAA-EMC/global-workflow#3120 (comment)

I've moved all activation of testing mode in JCB out of the JCB base
YAMLs and into the JCB algorithm YAMLs. I test the ```RUN``` variables
to make sure it's not equal to ```gfs```.

I re-ran all the regression tests, and they all passed.
@DavidNew-NOAA
Copy link
Contributor Author

GDASApp PR #1390 resolved the bug @RussTreadon-NOAA found and the GDASApp hash has been updated in this PR. This PR is ready for final review.

@RussTreadon-NOAA
Copy link
Contributor

Hera C96C48_ufs_hybatmDA testing

Update $HOMEgfs to 103de9d. Rebuild sorc/gdas.cd. Rewind failed gfs_atmanlvar. Job successfully ran to complete.

Resume cron running C96C48_ufs_hybatmDA (prjedi_pr3120).

The 20240224 06Z gdas_atmanlvar and enkfgdas_atmensanlobs failed the reference check.

gdas_atmanalvar.log (look in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prjedi_pr3120/logs/2024022406/)

 0: OOPS_STATS Run end                                  - Runtime:    884.49 sec,  Memory: total:   708.56 Gb, per task: min =     7.10 Gb, max =     8.16 Gb
 0: Run: Finishing oops::Variational<FV3JEDI, UFO and IODA observations> with status = 0
 0: terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
 0:   what():  Test reference Float mismatch @ Line:2
 0: Test Val : 3.5027703251845198e+05
 0: Ref  Val : 6.0590797038184456e+05
 0: Delta    : 2.5563093786339258e+05
 0: Relative tolerance: 4.7809250145014829e+02
 0: Absolute tolerance: 1.0000000000000001e-05
 0: Test Line: 'CostJo   : Nonlinear Jo(Aircraft) = 3.5027703251845198e+05, nobs = 255244, Jo/n = 1.3723222975601854e+00, err = 2.6705300233290301e+00'
 0: Ref Line : 'CostJo   : Nonlinear Jo(Aircraft) = 6.0590797038184456e+05, nobs = 504555, Jo/n = 1.2008759607611550e+00, err = 2.2437239435917040e+00'
srun: error: h34m16: task 0: Aborted (core dumped)
srun: Terminating StepId=3120309.0
 0: slurmstepd: error: *** STEP 3120309.0 ON h34m16 CANCELLED AT 2024-11-27T17:26:55 ***

enkfgdas_atmensanlobs.log

 0: OOPS_STATS Run end                                  - Runtime:    378.47 sec,  Memory: total:   664.46 Gb, per task: min =     6.66 Gb, max =     7.70 Gb
 0: Run: Finishing oops::LocalEnsembleDA<FV3JEDI, UFO and IODA observations> with status = 0
 0: terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
 0:   what():  Test reference Float mismatch @ Line:4
 0: Test Val : -8.9534286499023438e+01
 0: Ref  Val : -8.4384330749511719e+01
 0: Delta    : 5.1499557495117188e+00
 0: Relative tolerance: 8.6959308624267581e-02
 0: Absolute tolerance: 1.0000000000000001e-05
 0: Test Line: 'eastward_wind                                | Min:-8.9534286499023438e+01 Max:+1.1134776306152344e+02 RMS:+1.7480398016946587e+01'
 0: Ref Line : 'eastward_wind                                | Min:-8.4384330749511719e+01 Max:+1.1146717834472656e+02 RMS:+1.7423407399233511e+01'
srun: error: h34m16: task 0: Aborted (core dumped)
srun: Terminating StepId=3120009.0
 0: slurmstepd: error: *** STEP 3120009.0 ON h34m16 CANCELLED AT 2024-11-27T17:11:38 ***

These failures make sense. The reference files in $HOMEgfs/sorc/gdas.cd/test/testreference are for the 20240224 00Z cycle. We can not use the same reference files for cycled g-w CI. Date and run specific reference files are needed.

@DavidNew-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA I update the JCB algorithm yamls in https://github.com/NOAA-EMC/GDASApp/tree/bugfix/gw-ci to only use test references at the appropriate cycle time. I update the GDAS hash here to point to that branch. Can you retry the testing?

@RussTreadon-NOAA
Copy link
Contributor

@DavidNew-NOAA , my working copy of DavidNew-NOAA:feature/gw-ci has been updated to bring in https://github.com/NOAA-EMC/GDASApp/tree/bugfix/gw-ci at b850890

The DEAD prjedi_pr3120 jobs were rewound and rebooted. The 20240224 06Z gdas_atmanlvar and enkfgdas_atmensanlobs jobs successfully ran to completion.

I will let the prjedi_pr3120 run to completion. After this I will make a clean run of C96C48_ufs_hybatmDA from start to finish.

@RussTreadon-NOAA
Copy link
Contributor

g-w CI C96C48_ufs_hybatmDA

rocotostat /scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prjedi_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Nov 26 2024 14:20:17    Nov 26 2024 14:40:19
202402240000        Done    Nov 26 2024 14:20:17    Nov 28 2024 00:45:23
202402240600        Done    Nov 26 2024 14:20:17    Nov 28 2024 00:08:29

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Install DavidNew-NOAA:feature/gw-ci at 4af258c on Hera. The following g-w CI successfully ran all jobs

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C48_ATM_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103231200        Done    Nov 28 2024 13:40:15    Nov 28 2024 15:00:26
202103231800        Done    Nov 28 2024 13:40:15    Nov 28 2024 15:10:15

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C48_S2SW_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103231200        Done    Nov 28 2024 13:40:22    Nov 28 2024 17:09:49
202103231800        Done    Nov 28 2024 13:40:22    Nov 28 2024 17:18:48

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C48mx500_3DVarAOWCDA_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Nov 28 2024 13:40:23    Nov 28 2024 14:00:41
202103241800        Done    Nov 28 2024 13:40:23    Nov 28 2024 14:55:49

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C96C48_hybatmDA_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Nov 28 2024 13:40:25    Nov 28 2024 14:00:43
202112210000        Done    Nov 28 2024 13:40:25    Nov 28 2024 16:15:30
202112210600        Done    Nov 28 2024 13:40:25    Nov 28 2024 16:00:38

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C96C48_hybatmaerosnowDA_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Nov 28 2024 13:40:26    Nov 28 2024 14:05:28
202112201800        Done    Nov 28 2024 13:40:26    Nov 28 2024 16:20:28
202112210000        Done    Nov 28 2024 13:40:26    Nov 28 2024 16:00:39

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C96C48_ufs_hybatmDA_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Nov 28 2024 13:40:28    Nov 28 2024 14:00:47
202402240000        Done    Nov 28 2024 13:40:28    Nov 28 2024 16:50:28
202402240600        Done    Nov 28 2024 13:40:28    Nov 28 2024 16:40:28

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C96_atm3DVar_pr3120
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Nov 28 2024 13:40:31    Nov 28 2024 14:00:51
202112210000        Done    Nov 28 2024 13:40:31    Nov 28 2024 16:15:37
202112210600        Done    Nov 28 2024 13:40:31    Nov 28 2024 15:50:30

The following gefs g-w CI jobs died

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C48_S2SWA_gefs_pr3120
202103231200    gefs_fcst_mem000_seg0                     3151343                DEAD                 153         2         104.0
202103231200    gefs_fcst_mem001_seg0                     3151344                DEAD                 153         2         116.0
202103231200    gefs_fcst_mem002_seg0                     3151345                DEAD                 153         2         116.0

and

/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/C96_S2SWA_gefs_replay_ics_pr3120
202011010000    gefs_fcst_mem000_seg0                     3151354                DEAD                 174         2         117.0
202011010000    gefs_fcst_mem001_seg0                     3151355                DEAD                 174         2         147.0
202011010000    gefs_fcst_mem002_seg0                     3151356                DEAD                 174         2         132.0

The C48_S2SWA_gefs forecasts died with

 24:        Wave model ...
 0:  zeroing coupling accumulated fields at kdt=            1
 0:  zeroing coupling accumulated fields at kdt=            1
26: forrtl: severe (153): allocatable array or pointer is not allocated
26: Image              PC                Routine            Line        Source
26: ufs_model.x        00000000063BE842  Unknown               Unknown  Unknown
26: ufs_model.x        0000000001E6BBA8  pdlib_w3profsmd_m        7349  w3profsmd_pdlib.F90
26: ufs_model.x        0000000001BCA1D7  w3initmd_mp_w3ini        1244  w3initmd.F90
26: ufs_model.x        0000000001AE2CF7  wav_comp_nuopc_mp        1669  wav_comp_nuopc.F90
26: ufs_model.x        0000000000A9D754  Unknown               Unknown  Unknown

The C96_S2SWA_gefs_replay_ics forecast died with

132:        Wave model ...
136: [h24c41:2969163:0:2969163] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
138: [h24c41:2969165:0:2969165] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
136: forrtl: severe (174): SIGSEGV, segmentation fault occurred
136: Image              PC                Routine            Line        Source
136: ufs_model.x        0000000006379DEA  Unknown               Unknown  Unknown
136: libpthread-2.28.s  000014EC77722D10  Unknown               Unknown  Unknown
136: ufs_model.x        0000000001DF769E  pdlib_field_vec_m         501  pdlib_field_vec.F90
136: ufs_model.x        0000000001C80E06  w3iorsmd_mp_w3ior         802  w3iorsmd.F90
136: ufs_model.x        0000000001BC626B  w3initmd_mp_w3ini         961  w3initmd.F90
136: ufs_model.x        0000000001AE2CF7  wav_comp_nuopc_mp        1669  wav_comp_nuopc.F90
136: ufs_model.x        0000000000A9D754  Unknown               Unknown  Unknown

This PR does not alter gefs. Not sure if we expect these jobs to successfully run on Hera.

DA g-w CI passes on Hera. Approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants