Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272

dkokron · 2023-05-15T23:58:50Z

Performance profiling of a HAFS case on NOAA systems revealed significant of time was spent in fill_nested_grid_cpl(). The fill_nested_grid_cpl() routine from FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90 is showing up as a performance bottleneck. This routine gathers a global SST field (6,336,000 bytes) onto rank 0, then broadcasts that field to all ranks in the nest. The code uses point-to-point (p2p) messages (Isend/Recv) from rank 0 to the nest ranks. This communication pattern is maxing out the SlingShot-10 link on the first node resulting in a .15s hit every fifth time step. This works out to 189 seconds for a full 5-day simulation. These timings are from the 26 node HAFS case on acorn.

Given the way its currently coded coded, I expect the bottleneck to grow with the number of ranks in the nest. Timings from a 47-node production case run for 24-hour simulation on cactus with and without using the code changes contained in this PR.

An offline reproducer showed that setting the following environment variables improved the performance of using MPI_Bcast in this scenario.

export MPICH_BCAST_ONLY_TREE=0
export MPICH_COLL_OPT_OFF=mpi_bcast

Scenarios:

Unmodified code (Baseline)
Unmodified code with above environment settings
Code modified following this PR with environment settings
Code modified to use MPI_Bcast directly with above environment settings
Code modified to use MPI_Bcast directly without environment settings

Performance metric:
Add up the phase1 and phase2 timings printed in the output listing
grep PASS stdout | awk '{t+=$10;print t}' | tail -1
The units are seconds.

Scenario	Trial1	Trial2	Trial3	Trial4	Trial5	Trial6	Mean
1	966.2	960.9
2	964.9	966.9	986.6	976.4	x	x	973.7
3	971.3	955.8	934.4	941.6	944.8	936.7	947.4
4	899.3	896.5	915.6	911.3	909.6	912.0	907.3
5	895.1	899.9

Based on these timings, I expect savings from scenario 3 (this PR) of ~(26s)*5=131 seconds for a full 5-day simulation.
Based on these timings, I expect savings from scenario 4 (previous version of this PR) of ~(66s)*5=332 seconds for a full 5-day simulation.

This PR does not address any open issues.

I ran the UFS regression suite on acorn and cactus. Both runs resulted in "REGRESSION TEST WAS SUCCESSFUL"

Checklist:

Please check all whether they apply or not

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

…pl by a single MPI_Bcast

lharris4 · 2023-05-16T14:52:59Z

model/fv_control.F90

@@ -452,6 +456,9 @@ subroutine fv_control_init(Atm, dt_atmos, this_grid, grids_on_this_pe, p_split,
     allocate(global_pelist(npes))
     call mpp_get_current_pelist(global_pelist, commID=global_commID) ! for commID

+     ! Need the fcst_mpi_comm in order to construct the communicators used by MPI_Bcast in fill_nested_grid_cpl()
+     call ESMF_VMGetCurrent(vm=vm,rc=rc)


Thank you for bringing attention to this problem and glad to see that there is a solution. This idea could be useful for a lot of the tasks in creating and filling grids especially nests.

I believe there should be FMS calls that will allow us to do the communications directly between processors instead of using low-level MPI routines or the ESMF calls which are not part of FV3. @bensonr or @laurenchilutti will know for sure; if we don't have these please open an issue in FMS.

FMS does have a similar call (mpp_broadcast), but, at least the way I read the code, it creates the communicator on-the-fly every time its called. My testing shows that the creation of the communicator takes about the same amount of time as the actual Bcast. That's why I create the communicator in the fv_control_init(). Some numbers from an offline reproducer run on acorn. The offline code creates the MPI communicator every iteration. The first run takes a long time (cycles) probably due to MPI initialization. "setup subcomm" refers to the call to MPI_Comm_split()

Cycles to setup subcomm: 845895398
Cycles Bcast with a comm: 352677465

Subsequent calls within the same execution settle into a nice pattern.
Cycles to setup subcomm: 35054707
Cycles Bcast with a comm: 3219772

Cycles to setup subcomm: 1966522
Cycles Bcast with a comm: 3727102

Cycles to setup subcomm: 2671380
Cycles Bcast with a comm: 3211785

Cycles to setup subcomm: 1973183
Cycles Bcast with a comm: 3195247

Cycles to setup subcomm: 1884240
Cycles Bcast with a comm: 3205283

Cycles to setup subcomm: 2135385
Cycles Bcast with a comm: 3193492

Cycles to setup subcomm: 2105010
Cycles Bcast with a comm: 3203685

@dkokron - please send my your test program so I can confirm your findings with mpp_broadcast. As I look at the code, if the pelist has been defined previously, it should be using a predefined communicator. A new communicator is defined only when a new, unique pelist is encountered.

It's on cactus. I will upload when cactus comes back from dedicated time.

Perhaps then it's enough to create a pelist with the appropriate PEs in fv_control_init() instead of an MPI_Comm.

Thanks for letting me know about the cactus downtime.

There should be no need to create the pelist in fv_control. It can be defined/used within the fill_nested_grid_cpl function. The call to get_peset here will execute this code in mpp_util_mpi.inc. If the pelist is recognized, it will return the mpp-managed internal list location and mpp_broadcast will use the communicator associated with that list entry. If the pelist isn't recognized, it will add it to the linked list of known pelists and create a communicator via MPI_Comm_create_group. If you want to look over the code with me, let me know and we can schedule it.

I see your point about being able to use get_peset() to create the communicator once storing it in the pset structure, then mpp_broadcast() reuses that communicator. My implementation does essentially the same thing while storing the communicator in the fv_atmos_type structure.

I'm a little worried about the extra work done in the MPP approach. Every call to get_peset() will check for monotonicity and check for a match to an existing peset. I'll see if I can code up the MPP approach in my unit tester and determine if my worry is justified.

My worry was about the extra work was unfounded. I will re-code the PR using mpp_broadcast.

You can find my unit testers under clogin02:/u/daniel.kokron/Bcast/
The testers simulate what happens in a HAFS case with one nest run on 736 ranks.

Some timings:
grep Cycles Bcast.o58131230

Using MPI_Bcast with a pre-made communicator (C implementation)
Cycles to setup subcomm: 789298313
Cycles Bcast with a comm: 350650485
Cycles Bcast with a comm: 96031373
Cycles Bcast with a comm: 3138885
Cycles Bcast with a comm: 3160485
Cycles Bcast with a comm: 3150698
Cycles Bcast with a comm: 3127387
Cycles Bcast with a comm: 3128850
Cycles Bcast with a comm: 3114293
Cycles Bcast with a comm: 3123540
Cycles Bcast with a comm: 3147165
Cycles Bcast with a comm: 3136793
Cycles Bcast with a comm: 3124845
Cycles Bcast with a comm: 3137737
Cycles Bcast with a comm: 3132450
Cycles Bcast with a comm: 3156840

Using MPI_Bcast with a pre-made communicator (Fortran implementation)
Cycles to setup subcomm 136254240
Cycles rank/thread 332684775
Cycles rank/thread 4624537
Cycles rank/thread 3276248
Cycles rank/thread 3257415
Cycles rank/thread 3283357
Cycles rank/thread 3271545
Cycles rank/thread 3424860
Cycles rank/thread 3280073
Cycles rank/thread 3279398
Cycles rank/thread 3288915
Cycles rank/thread 3359723
Cycles rank/thread 3296475
Cycles rank/thread 3246457
Cycles rank/thread 3312000
Cycles rank/thread 3278633

Using mpp_broadcast()
Cycles rank/thread 548630797
Cycles rank/thread 3106642
Cycles rank/thread 3273503
Cycles rank/thread 3308108
Cycles rank/thread 3264412
Cycles rank/thread 3284077
Cycles rank/thread 3473212
Cycles rank/thread 3260768
Cycles rank/thread 3294653
Cycles rank/thread 3311325
Cycles rank/thread 3297983
Cycles rank/thread 3296722
Cycles rank/thread 3306937
Cycles rank/thread 3325117
Cycles rank/thread 3270375

@dkokron - I appreciate your willingness to work with us on this.

Never easy.

Swapping in mpp_broadcast in place of MPI_Bcast in fill_nested_grid_cpl() results in runtime failure.

subroutine fill_nested_grid_cpl(this_grid, proc_in) - use mpi + !use mpi + use mpp_mod, only: mpp_broadcast integer, intent(in) :: this_grid logical, intent(in), optional :: proc_in @@ -2460,7 +2461,8 @@ contains g_dat(isg:,jsg:,1), position=CENTER) endif if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then - call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm + !call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm + call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%sending_proc,Atm(this_grid)%Bcast_ranks) endif call timing_off('COMM_TOTAL') if (process) then

The stack trace points to a part of the code I didn't change.

nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: eval_move_nest: move_cd_x= 0 move_cd_y= 0 do_move= F
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: Opened BC file: INPUT/gfs_bndy.tile7.003.nc
nid001001.acorn.wcoss2.ncep.noaa.gov 0: forrtl: error (78): process killed (SIGTERM)
nid001001.acorn.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000258B397 fv_regional_mod_m 3387 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000257BCD5 fv_regional_mod_m 1792 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000259E347 fv_regional_mod_m 1592 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000291AABA atmosphere_mod_mp 673 atmosphere.F90

lharris4 · 2023-05-23T02:24:18Z

Hi. Sorry for the delay. This is a bit odd and might be a sign of a memory bug somewhere. How soon in the run does this error trigger? Also, does running in debug mode raise any errors? Thanks, Lucas

…

On Wed, May 17, 2023 at 1:28 PM dkokron ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In model/fv_control.F90 <#272 (comment)> : > @@ -452,6 +456,9 @@ subroutine fv_control_init(Atm, dt_atmos, this_grid, grids_on_this_pe, p_split, allocate(global_pelist(npes)) call mpp_get_current_pelist(global_pelist, commID=global_commID) ! for commID + ! Need the fcst_mpi_comm in order to construct the communicators used by MPI_Bcast in fill_nested_grid_cpl() + call ESMF_VMGetCurrent(vm=vm,rc=rc) Never easy. Swapping in mpp_broadcast in place of MPI_Bcast in fill_nested_grid_cpl() results in runtime failure. subroutine fill_nested_grid_cpl(this_grid, proc_in) - use mpi + !use mpi + use mpp_mod, only: mpp_broadcast integer, intent(in) :: this_grid logical, intent(in), optional :: proc_in @@ -2460,7 +2461,8 @@ contains g_dat(isg:,jsg:,1), position=CENTER) endif if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then - call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm + !call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm + call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%sending_proc,Atm(this_grid)%Bcast_ranks) endif call timing_off('COMM_TOTAL') if (process) then The stack trace points to a part of the code I didn't change. nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: eval_move_nest: move_cd_x= 0 move_cd_y= 0 do_move= F nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: Opened BC file: INPUT/gfs_bndy.tile7.003.nc nid001001.acorn.wcoss2.ncep.noaa.gov 0: forrtl: error (78): process killed (SIGTERM) nid001001.acorn.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000258B397 fv_regional_mod_m 3387 fv_regional_bc.F90 nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000257BCD5 fv_regional_mod_m 1792 fv_regional_bc.F90 nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000259E347 fv_regional_mod_m 1592 fv_regional_bc.F90 nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000291AABA atmosphere_mod_mp 673 atmosphere.F90 — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMUQRVBAB4TBME2SXZ2JZITXGUDB7ANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you commented.Message ID: ***@***.***>

dkokron · 2023-05-23T02:44:35Z

I was able to get it working with mpp_broadcast() by playing games with mpp_set_current_pelist(). I'm running the regression suite now and will redo some of the performance testing to make sure it still helps.

lharris4 · 2023-05-23T13:33:06Z

That is great. Thanks for continuing to work on this. Lucas

…

On Mon, May 22, 2023 at 10:44 PM dkokron ***@***.***> wrote: I was able to get it working with mpp_broadcast() by playing games with mpp_set_current_pelist(). I'm running the regression suite now and will redo some of the performance testing to make sure it still helps. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMUQRVAAMZFNNF3EIWWINXDXHQQB5ANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you commented.Message ID: ***@***.***>

dkokron · 2023-05-24T13:48:20Z

I gathered more timings and updated the table. Using MPI_Bcast directly is significantly faster than using mpp_broadcast(). Please review the code and let me know if I'm using MPP incorrectly.

bensonr · 2023-05-24T15:47:20Z

@dkokron - what timings did you get using mpp_broadcast for comparison with the data in the PR description

dkokron · 2023-05-24T16:56:31Z

Scenario 3. The underlying code in the PR now uses mpp_broadcast().

…

On Wed, May 24, 2023 at 10:47 AM Rusty Benson ***@***.***> wrote: @dkokron <https://github.com/dkokron> - what timings did you get using mpp_broadcast for comparison with the data in the PR description — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2HQB7N7GROEBRO3HATXHYURHANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-05-29T14:57:37Z

Given the significant difference in timings between scenarios 3 (mpp_braodcast) and 4 (MPI_Bcast), which proposal should we move froward with?

lharris4 · 2023-05-31T17:32:19Z

Hello @dkokron. @bensonr and I are discussing how to move forward. There may be an underlying problem in the way the SST is being gathered in the first place that isn't addressed by this fix.

dkokron · 2023-05-31T17:54:11Z

What do you have in mind? Let me know if you need me to test anything.

…

On Wed, May 31, 2023 at 12:32 PM lharris4 ***@***.***> wrote: Hello @dkokron <https://github.com/dkokron>. @bensonr <https://github.com/bensonr> and I are discussing how to move forward. There may be an underlying problem in the way the SST is being gathered in the first place that isn't addressed by this fix. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2CXRAUZDX2IFSRBBPDXI56C7ANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bensonr · 2023-06-01T17:18:48Z

@dkokron - I have a test case that creates a 6 tile mosaic with a nest. I'd like to run your exact sizes to get some timings on gaea. What core counts and nest configuration are you using so I can have a good comparison.

dkokron · 2023-06-01T17:31:49Z

The 26-node case is on acorn /lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L ***@***.*** T2O_2020092200_17L > ll *.nml -rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml -rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml -rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml -rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml -> input736_nest02.nml lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml -> input736.nml The production case is on cactus. /u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast

…

On Thu, Jun 1, 2023 at 12:18 PM Rusty Benson ***@***.***> wrote: @dkokron - I have a test case that creates a 6 tile mosaic with a nest. I'd like to run your exact sizes to get some timings on gaea. What core counts and nest configuration are you using so I can have a good comparison. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2CNX5XEUP5ZP4FCRNLXJDFIFANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bensonr · 2023-06-01T17:38:52Z

I gave up my access to NCEP/NCO when WCOSS2 came online and the rules and restrictions became too much for me to maintain it. I'd appreciate if you could give me the domain sizes (npx,npy,npz) and layouts for the global and nest, as well as the nest_nml info for the two cases you are interested in - thx The 26-node case is on acorn

…

/lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L ***@***.*** T2O_2020092200_17L > ll *.nml -rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml -rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml -rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml -rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml -> input736_nest02.nml lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml -> input736.nml The production case is on cactus. /u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast Message ID: ***@***.*** .com>

dkokron · 2023-06-01T17:57:54Z

26-node case Parent grid layout = 16,28 ! LongEW io_layout = 1,1 npx = 1321 npy = 1201 ntiles = 1 npz = 81 grid_pes = 448,288 Nest grid layout = 12,24 ! LongEW io_layout = 1,1 npx = 601 npy = 601 ntiles = 1 npz = 81 47-node case ------------------ Parent grid layout = 36,50 io_layout = 1,1 npx = 1201 npy = 1201 ntiles = 1 npz = 81 grid_pes = 1800,840 Nest grid layout = 20,42 io_layout = 1,1 npx = 601 npy = 601 ntiles = 1 npz = 81

…

On Thu, Jun 1, 2023 at 12:39 PM Rusty Benson ***@***.***> wrote: I gave up my access to NCEP/NCO when WCOSS2 came online and the rules and restrictions became too much for me to maintain it. I'd appreciate if you could give me the domain sizes (npx,npy,npz) and layouts for the global and nest, as well as the nest_nml info for the two cases you are interested in - thx The 26-node case is on acorn > /lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L > > ***@***.*** T2O_2020092200_17L > ll *.nml > -rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml > -rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml > -rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml > -rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml > lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml -> > input736_nest02.nml > lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml -> > input736.nml > > The production case is on cactus. > /u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast > > Message ID: ***@***.*** > .com> > — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2HEGP7M4GAEQR3TBTDXJDHTVANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bensonr · 2023-06-02T16:08:25Z

Thanks for passing all of this along. I finally have the prototype test code running where I create a top-level domain with a nest. Using your data for the 47-node case (21 nodes on gaea with no OMP threads), I have generated data for both 32-bit and 64-bit quantities. The sample sizes were 25 and 50 respectively. The first iteration for each was a touch longer as they set up the pelist and communicator. For 64-bit data, the max average time for: global gather on 1800 parent ranks is: 0.16s broadcast from 1 parent rank to 840 nest ranks is: 0.058s For 32bit data, the max average time for: global gather on 1800 parent ranks is: 0.109s broadcast from 1 parent rank to 840 nest ranks is: 0.029s In your code you are swapping scope via the mpp_set_current_pelist call. There is no need for the scope changes when using explicit pelists for mpp communication. do n=1,50 t_dum = mpp_get_tile_id(domain_coarse) if (((t_dum(1) == tile_coarse(1)) .and.mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then time_it = MPI_Wtime() bcast_npes = mpp_get_domain_npes(domain_fine) + 1 allocate(bcast_pelist(bcast_npes)) bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse) call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes)) call mpp_broadcast(xg, size(xg), bcast_pelist(1),bcast_pelist) time_it = MPI_Wtime() - time_it t_max=time_it t_min=time_it call mpp_max(t_max,bcast_pelist) call mpp_min(t_min,bcast_pelist) if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)') ' time for bcast ', t_min, t_max deallocate(bcast_pelist) endif enddo In the code above, all MPI-ranks enter the do-loop but only those participating in the broadcast enter the if-test. The above code is not 100% generic, though you can simplify it immensely for use in fill_nested_grid_cpl All MPICH parameters were the defaults as defined by HPE/Cray, no MPICH tunables were utilized for this test.

dkokron · 2023-06-02T16:53:45Z

Rusty, I was getting runtime errors without the scoping. I think I see my error. I am starting my bcast_npes indexing at zero not 1. More testing. Dan

…

On Fri, Jun 2, 2023 at 11:08 AM Rusty Benson ***@***.***> wrote: Thanks for passing all of this along. I finally have the prototype test code running where I create a top-level domain with a nest. Using your data for the 47-node case (21 nodes on gaea with no OMP threads), I have generated data for both 32-bit and 64-bit quantities. The sample sizes were 25 and 50 respectively. The first iteration for each was a touch longer as they set up the pelist and communicator. For 64-bit data, the max average time for: global gather on 1800 parent ranks is: 0.16s broadcast from 1 parent rank to 840 nest ranks is: 0.058s For 32bit data, the max average time for: global gather on 1800 parent ranks is: 0.109s broadcast from 1 parent rank to 840 nest ranks is: 0.029s In your code you are swapping scope via the mpp_set_current_pelist call. There is no need for the scope changes when using explicit pelists for mpp communication. do n=1,50 t_dum = mpp_get_tile_id(domain_coarse) if (((t_dum(1) == tile_coarse(1)) .and. mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then time_it = MPI_Wtime() bcast_npes = mpp_get_domain_npes(domain_fine) + 1 allocate(bcast_pelist(bcast_npes)) bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse) call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes)) call mpp_broadcast(xg, size(xg), bcast_pelist(1), bcast_pelist) time_it = MPI_Wtime() - time_it t_max=time_it t_min=time_it call mpp_max(t_max,bcast_pelist) call mpp_min(t_min,bcast_pelist) if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)') ' time for bcast ', t_min, t_max deallocate(bcast_pelist) endif enddo In the code above, all MPI-ranks enter the do-loop but only those participating in the broadcast enter the if-test. The above code is not 100% generic, though you can simplify it immensely for use in fill_nested_grid_cpl — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2AARAHAURBLMJQTE73XJIFYJANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-06-02T18:19:34Z

Rusty, Please send me your complete reproducer.

…

On Fri, Jun 2, 2023 at 11:08 AM Rusty Benson ***@***.***> wrote: Thanks for passing all of this along. I finally have the prototype test code running where I create a top-level domain with a nest. Using your data for the 47-node case (21 nodes on gaea with no OMP threads), I have generated data for both 32-bit and 64-bit quantities. The sample sizes were 25 and 50 respectively. The first iteration for each was a touch longer as they set up the pelist and communicator. For 64-bit data, the max average time for: global gather on 1800 parent ranks is: 0.16s broadcast from 1 parent rank to 840 nest ranks is: 0.058s For 32bit data, the max average time for: global gather on 1800 parent ranks is: 0.109s broadcast from 1 parent rank to 840 nest ranks is: 0.029s In your code you are swapping scope via the mpp_set_current_pelist call. There is no need for the scope changes when using explicit pelists for mpp communication. do n=1,50 t_dum = mpp_get_tile_id(domain_coarse) if (((t_dum(1) == tile_coarse(1)) .and. mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then time_it = MPI_Wtime() bcast_npes = mpp_get_domain_npes(domain_fine) + 1 allocate(bcast_pelist(bcast_npes)) bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse) call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes)) call mpp_broadcast(xg, size(xg), bcast_pelist(1), bcast_pelist) time_it = MPI_Wtime() - time_it t_max=time_it t_min=time_it call mpp_max(t_max,bcast_pelist) call mpp_min(t_min,bcast_pelist) if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)') ' time for bcast ', t_min, t_max deallocate(bcast_pelist) endif enddo In the code above, all MPI-ranks enter the do-loop but only those participating in the broadcast enter the if-test. The above code is not 100% generic, though you can simplify it immensely for use in fill_nested_grid_cpl — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2AARAHAURBLMJQTE73XJIFYJANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bensonr · 2023-06-02T19:09:45Z

Sure, I'll share it via your email.

dkokron · 2023-06-02T19:15:40Z

***@***.*** or ***@***.***

…

On Fri, Jun 2, 2023 at 2:09 PM Rusty Benson ***@***.***> wrote: Sure, I'll share it via your email. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2CZEQAOCLRB3IM7Q5TXJI3AHANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bensonr · 2023-06-02T19:19:15Z

Unfortunately, github blocked the addresses you just included. I was planning to send to your NOAA account, if that's okay with you.

dkokron · 2023-06-02T19:29:23Z

oops ***@***.*** (not gov) NOAA account works too.

…

On Fri, Jun 2, 2023 at 2:19 PM Rusty Benson ***@***.***> wrote: Unfortunately, github blocked the addresses you just included. I was planning to send to your NOAA account, if that's okay with you. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2D2RNEB5DW4Z4I74UDXJI4D3ANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-06-02T20:11:16Z

This is what I get when using mpp_broadcast() in HAFS without the scoping. I really cannot imagine how that -448 is getting in there. nid001022.acorn.wcoss2.ncep.noaa.gov 724: fill_nested_grid_cpl: 724 nid001022.acorn.wcoss2.ncep.noaa.gov 724: MPICH ERROR [Rank 724] [job id 65bd0239-958a-4cad-bd26-b2808f73b8ba] [Fri Jun 2 19:53:29 2023] [nid001022] - Abort(940145670) (rank 724 in comm 0): Fatal error in PMPI_Group_incl: Invalid rank, error stack: nid001022.acorn.wcoss2.ncep.noaa.gov 724: PMPI_Group_incl(170).............: MPI_Group_incl(group=0xc8000006, n=289, ranks=0x7fffc9caf880, new_group=0x14713ad803f8) failed nid001022.acorn.wcoss2.ncep.noaa.gov 724: MPIR_Group_check_valid_ranks(244): Invalid rank in rank array at index 0; value is -448 but must be in the range 0 to 287 nid001022.acorn.wcoss2.ncep.noaa.gov 724: nid001022.acorn.wcoss2.ncep.noaa.gov 724: aborting job: if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then print*,'fill_nested_grid_cpl: ',mpp_pe() call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%Bcast_ranks(1), Atm(this_grid)%Bcast_ranks) endif

…

On Fri, Jun 2, 2023 at 2:19 PM Rusty Benson ***@***.***> wrote: Unfortunately, github blocked the addresses you just included. I was planning to send to your NOAA account, if that's okay with you. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2D2RNEB5DW4Z4I74UDXJI4D3ANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-06-06T02:05:42Z

Submitted NOAA-GFDL/FMS#1246 to address the failure seen above when using mpp_broadcast().

Please consider using the pure MPI approach to address this performance issue.
We can revisit mpp if that PR is accepted, released, and approved by NOAA.

uturuncoglu · 2023-08-15T15:08:07Z

@junwang-noaa We are not using FMS for I/O anymore. We switched to ESMF multi-tile support for I/O. This will be available once we merge the ufs-community/ufs-weather-model#1845

bensonr · 2023-08-15T18:03:24Z

@junwang-noaa - I know this is a little off topic for the dycore repo here... While you are working through that list, you still have the ability to compile in the deprecated io modules via a -Duse_deprecated_io flag (see FMS release notes). We test MOM6 as part of our FMS release testing and there are some updates from the main trunk that will need to be pulled into the ufs branch to resolve any issues.

junwang-noaa · 2023-08-15T19:09:35Z

Rusty, thanks for the information!

dkokron · 2023-08-23T13:48:28Z

As far as GFDL_atmos_cubed_sphere is concerned, is this PR fit to move forward?

laurenchilutti · 2023-08-23T13:53:38Z

As long as FMS is compiled with the flag -Duse_deprecated_io, this PR is fit to move forward. @jkbk2004 as UFS code manager, when do you think we can have this PR merged?

jkbk2004 · 2023-08-23T13:59:32Z

We are working on Spack stack PR. For FMS-2023-02, JEDI team need to use -Duse_deprecated_io. A bit coordination and test is needed to confirm the test results. I think we can schedule to commit next week or so. But I will keep checking as test going on.

dkokron · 2023-08-23T16:30:25Z

In that case, should I extract the FMS-2023.02 porting modifications from my branch?

laurenchilutti · 2023-08-23T16:42:51Z

In that case, should I extract the FMS-2023.02 porting modifications from my branch?

I see no harm in keeping these changes in.

bensonr · 2023-08-23T14:44:29Z

model/fv_arrays.F90

@@ -1843,6 +1848,7 @@ end subroutine allocate_fv_atmos_type

 !>@brief The subroutine 'deallocate_fv_atmos_type' deallocates the fv_atmos_type.
  subroutine deallocate_fv_atmos_type(Atm)
+    use mpi


Is this still needed?

Not needed anymore since we are using mpp_broadcast() now instead of MPI_Bcast().

Would like to see this use statement removed prior to accepting.

Removed unneeded use of MPI module in model/fv_arrays.F90

tools/fv_grid_tools.F90

model/fv_control.F90

dkokron · 2023-09-29T08:50:00Z

Rusty, I'm on pto until 3 Oct. I won't have github access until then.

…

On Fri, Sep 29, 2023, 1:13 AM Rusty Benson ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In model/fv_arrays.F90 <#272 (comment)> : > @@ -1843,6 +1848,7 @@ end subroutine allocate_fv_atmos_type ***@***.*** The subroutine 'deallocate_fv_atmos_type' deallocates the fv_atmos_type. subroutine deallocate_fv_atmos_type(Atm) + use mpi Would like to see this *use* statement removed prior to accepting. — Reply to this email directly, view it on GitHub <#272 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2HPGFWN7WSELVCE6S3X4YADFANCNFSM6AAAAAAYC4OKZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

laurenchilutti · 2023-10-20T11:49:04Z

@bensonr are there anymore changes that you would like to see in this PR?

dkokron · 2023-11-07T21:01:23Z

The FMS modification that supports this PR was merged. See details at NOAA-GFDL/FMS#1246

bensonr · 2023-11-08T00:27:15Z

@dkokron - at the bi-weekly UFS code managers meeting we have been telling them this is ready to merge - they have not yet blessed it.

laurenchilutti · 2023-12-15T16:43:10Z

The UFS code managers will be ready to merge this change in in the coming weeks. @dkokron could you create a fv3atm PR and UFS-weather-model PR for this change?

dkokron · 2023-12-15T21:39:00Z

@laurenchilutti Will do.

dkokron · 2023-12-20T23:03:14Z

Opened PR against fv3atm
NOAA-EMC/fv3atm#743

dkokron · 2023-12-20T23:13:47Z

Opened PR with UFS
ufs-community/ufs-weather-model#2059

…stedGridCpl Using MPI broadcast instead of multiple point-to-points message to improve performance

FernandoAndrade-NOAA · 2024-01-22T16:17:52Z

Testing on #2059 has completed successfully, please proceed with the merge process for this PR.

zach1221 · 2024-01-22T16:44:02Z

Testing on #2059 has completed successfully, please proceed with the merge process for this PR.

Copying in @laurenchilutti as well for this request.

jkbk2004 · 2024-01-22T17:28:12Z

@bensonr can you merge this pr?

Dan Kokron added 3 commits May 13, 2023 21:08

Modifications to replace numerous p2p transfers in fill_nested_grid_c…

4baa5c0

…pl by a single MPI_Bcast

Add reason for including ESMF in fv_control_init

bfd88ed

Add code to cleanup structures associated with Bcast_comm

bf32119

lharris4 reviewed May 16, 2023

View reviewed changes

Get rid of unneeded mpp_sync_self

1426af8

Use mpp_broadcast instead of MPI_Bcast at the request of NOAA-GFDL

c7313bf

dkokron mentioned this pull request Jun 3, 2023

Wrong peset used in mpp_broadcast_ NOAA-GFDL/FMS#1243

Closed

dkokron mentioned this pull request Jun 9, 2023

Replace current_peset_num with 'n' which should be the active pelist. NOAA-GFDL/FMS#1246

Merged

8 tasks

junwang-noaa mentioned this pull request Aug 18, 2023

Update to [email protected] ufs-community/ufs-weather-model#1874

Closed

bensonr reviewed Aug 23, 2023

View reviewed changes

laurenchilutti requested a review from bensonr September 28, 2023 18:25

Remove uneeded use of MPI module

5389cf5

bensonr approved these changes Oct 20, 2023

View reviewed changes

dkokron mentioned this pull request Dec 20, 2023

Use MPI_Bcast instead of multiple p2p messages to update nest from parent NOAA-EMC/fv3atm#743

Merged

dkokron mentioned this pull request Dec 20, 2023

Use MPI_Bcast instead of multiple p2p messages to update nest from parent ufs-community/ufs-weather-model#2059

Merged

42 tasks

BinLiu-NOAA mentioned this pull request Jan 17, 2024

Include surface ocean current in the computation of air-sea fluxes #311

Merged

6 tasks

Merge remote-tracking branch 'upstream/dev/emc' into hafs-BcastFillNe…

0be6f6f

…stedGridCpl Using MPI broadcast instead of multiple point-to-points message to improve performance

jkbk2004 mentioned this pull request Jan 19, 2024

Sync support to integrate PR #743 to Use MPI_Bcast instead of multiple p2p messages to update nest NOAA-EMC/fv3atm#771

Closed

bensonr merged commit 65301dd into NOAA-GFDL:dev/emc Jan 22, 2024

Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272

Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272

Conversation

dkokron commented May 15, 2023 • edited Loading

Choose a reason for hiding this comment

dkokron May 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkokron May 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lharris4 commented May 23, 2023 via email

dkokron commented May 23, 2023

lharris4 commented May 23, 2023 via email

dkokron commented May 24, 2023

bensonr commented May 24, 2023

dkokron commented May 24, 2023 via email • edited Loading

dkokron commented May 29, 2023

lharris4 commented May 31, 2023

dkokron commented May 31, 2023 via email

bensonr commented Jun 1, 2023 via email

dkokron commented Jun 1, 2023 via email

bensonr commented Jun 1, 2023 via email

dkokron commented Jun 1, 2023 via email

bensonr commented Jun 2, 2023 via email • edited Loading

dkokron commented Jun 2, 2023 via email

dkokron commented Jun 2, 2023 via email

bensonr commented Jun 2, 2023 via email

dkokron commented Jun 2, 2023 via email

bensonr commented Jun 2, 2023

dkokron commented Jun 2, 2023 via email

dkokron commented Jun 2, 2023 via email

dkokron commented Jun 6, 2023

uturuncoglu commented Aug 15, 2023

bensonr commented Aug 15, 2023

junwang-noaa commented Aug 15, 2023

dkokron commented Aug 23, 2023

laurenchilutti commented Aug 23, 2023

jkbk2004 commented Aug 23, 2023

dkokron commented Aug 23, 2023

laurenchilutti commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkokron commented Sep 29, 2023 via email

laurenchilutti commented Oct 20, 2023

dkokron commented Nov 7, 2023

bensonr commented Nov 8, 2023 • edited Loading

laurenchilutti commented Dec 15, 2023

dkokron commented Dec 15, 2023

dkokron commented Dec 20, 2023

dkokron commented Dec 20, 2023

FernandoAndrade-NOAA commented Jan 22, 2024

zach1221 commented Jan 22, 2024

jkbk2004 commented Jan 22, 2024

dkokron commented May 15, 2023 •

edited

Loading

dkokron May 16, 2023 •

edited

Loading

dkokron May 16, 2023 •

edited

Loading

dkokron commented May 24, 2023 via email •

edited

Loading

bensonr commented Jun 2, 2023 via email •

edited

Loading

bensonr commented Nov 8, 2023 •

edited

Loading