-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272
Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272
Conversation
model/fv_control.F90
Outdated
@@ -452,6 +456,9 @@ subroutine fv_control_init(Atm, dt_atmos, this_grid, grids_on_this_pe, p_split, | |||
allocate(global_pelist(npes)) | |||
call mpp_get_current_pelist(global_pelist, commID=global_commID) ! for commID | |||
|
|||
! Need the fcst_mpi_comm in order to construct the communicators used by MPI_Bcast in fill_nested_grid_cpl() | |||
call ESMF_VMGetCurrent(vm=vm,rc=rc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing attention to this problem and glad to see that there is a solution. This idea could be useful for a lot of the tasks in creating and filling grids especially nests.
I believe there should be FMS calls that will allow us to do the communications directly between processors instead of using low-level MPI routines or the ESMF calls which are not part of FV3. @bensonr or @laurenchilutti will know for sure; if we don't have these please open an issue in FMS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FMS does have a similar call (mpp_broadcast), but, at least the way I read the code, it creates the communicator on-the-fly every time its called. My testing shows that the creation of the communicator takes about the same amount of time as the actual Bcast. That's why I create the communicator in the fv_control_init(). Some numbers from an offline reproducer run on acorn. The offline code creates the MPI communicator every iteration. The first run takes a long time (cycles) probably due to MPI initialization. "setup subcomm" refers to the call to MPI_Comm_split()
Cycles to setup subcomm: 845895398
Cycles Bcast with a comm: 352677465
Subsequent calls within the same execution settle into a nice pattern.
Cycles to setup subcomm: 35054707
Cycles Bcast with a comm: 3219772
Cycles to setup subcomm: 1966522
Cycles Bcast with a comm: 3727102
Cycles to setup subcomm: 2671380
Cycles Bcast with a comm: 3211785
Cycles to setup subcomm: 1973183
Cycles Bcast with a comm: 3195247
Cycles to setup subcomm: 1884240
Cycles Bcast with a comm: 3205283
Cycles to setup subcomm: 2135385
Cycles Bcast with a comm: 3193492
Cycles to setup subcomm: 2105010
Cycles Bcast with a comm: 3203685
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkokron - please send my your test program so I can confirm your findings with mpp_broadcast. As I look at the code, if the pelist has been defined previously, it should be using a predefined communicator. A new communicator is defined only when a new, unique pelist is encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on cactus. I will upload when cactus comes back from dedicated time.
Perhaps then it's enough to create a pelist with the appropriate PEs in fv_control_init() instead of an MPI_Comm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know about the cactus downtime.
There should be no need to create the pelist in fv_control. It can be defined/used within the fill_nested_grid_cpl function. The call to get_peset here will execute this code in mpp_util_mpi.inc. If the pelist is recognized, it will return the mpp-managed internal list location and mpp_broadcast will use the communicator associated with that list entry. If the pelist isn't recognized, it will add it to the linked list of known pelists and create a communicator via MPI_Comm_create_group. If you want to look over the code with me, let me know and we can schedule it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about being able to use get_peset() to create the communicator once storing it in the pset structure, then mpp_broadcast() reuses that communicator. My implementation does essentially the same thing while storing the communicator in the fv_atmos_type structure.
I'm a little worried about the extra work done in the MPP approach. Every call to get_peset() will check for monotonicity and check for a match to an existing peset. I'll see if I can code up the MPP approach in my unit tester and determine if my worry is justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry was about the extra work was unfounded. I will re-code the PR using mpp_broadcast.
You can find my unit testers under clogin02:/u/daniel.kokron/Bcast/
The testers simulate what happens in a HAFS case with one nest run on 736 ranks.
Some timings:
grep Cycles Bcast.o58131230
Using MPI_Bcast with a pre-made communicator (C implementation)
Cycles to setup subcomm: 789298313
Cycles Bcast with a comm: 350650485
Cycles Bcast with a comm: 96031373
Cycles Bcast with a comm: 3138885
Cycles Bcast with a comm: 3160485
Cycles Bcast with a comm: 3150698
Cycles Bcast with a comm: 3127387
Cycles Bcast with a comm: 3128850
Cycles Bcast with a comm: 3114293
Cycles Bcast with a comm: 3123540
Cycles Bcast with a comm: 3147165
Cycles Bcast with a comm: 3136793
Cycles Bcast with a comm: 3124845
Cycles Bcast with a comm: 3137737
Cycles Bcast with a comm: 3132450
Cycles Bcast with a comm: 3156840
Using MPI_Bcast with a pre-made communicator (Fortran implementation)
Cycles to setup subcomm 136254240
Cycles rank/thread 332684775
Cycles rank/thread 4624537
Cycles rank/thread 3276248
Cycles rank/thread 3257415
Cycles rank/thread 3283357
Cycles rank/thread 3271545
Cycles rank/thread 3424860
Cycles rank/thread 3280073
Cycles rank/thread 3279398
Cycles rank/thread 3288915
Cycles rank/thread 3359723
Cycles rank/thread 3296475
Cycles rank/thread 3246457
Cycles rank/thread 3312000
Cycles rank/thread 3278633
Using mpp_broadcast()
Cycles rank/thread 548630797
Cycles rank/thread 3106642
Cycles rank/thread 3273503
Cycles rank/thread 3308108
Cycles rank/thread 3264412
Cycles rank/thread 3284077
Cycles rank/thread 3473212
Cycles rank/thread 3260768
Cycles rank/thread 3294653
Cycles rank/thread 3311325
Cycles rank/thread 3297983
Cycles rank/thread 3296722
Cycles rank/thread 3306937
Cycles rank/thread 3325117
Cycles rank/thread 3270375
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkokron - I appreciate your willingness to work with us on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never easy.
Swapping in mpp_broadcast in place of MPI_Bcast in fill_nested_grid_cpl() results in runtime failure.
subroutine fill_nested_grid_cpl(this_grid, proc_in)
- use mpi
+ !use mpi
+ use mpp_mod, only: mpp_broadcast
integer, intent(in) :: this_grid
logical, intent(in), optional :: proc_in
@@ -2460,7 +2461,8 @@ contains
g_dat(isg:,jsg:,1), position=CENTER)
endif
if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then
- call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+ !call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+ call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%sending_proc,Atm(this_grid)%Bcast_ranks)
endif
call timing_off('COMM_TOTAL')
if (process) then
The stack trace points to a part of the code I didn't change.
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: eval_move_nest: move_cd_x= 0 move_cd_y= 0 do_move= F
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: Opened BC file: INPUT/gfs_bndy.tile7.003.nc
nid001001.acorn.wcoss2.ncep.noaa.gov 0: forrtl: error (78): process killed (SIGTERM)
nid001001.acorn.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000258B397 fv_regional_mod_m 3387 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000257BCD5 fv_regional_mod_m 1792 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000259E347 fv_regional_mod_m 1592 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000291AABA atmosphere_mod_mp 673 atmosphere.F90
Hi. Sorry for the delay. This is a bit odd and might be a sign of a memory
bug somewhere. How soon in the run does this error trigger? Also, does
running in debug mode raise any errors?
Thanks,
Lucas
…On Wed, May 17, 2023 at 1:28 PM dkokron ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In model/fv_control.F90
<#272 (comment)>
:
> @@ -452,6 +456,9 @@ subroutine fv_control_init(Atm, dt_atmos, this_grid, grids_on_this_pe, p_split,
allocate(global_pelist(npes))
call mpp_get_current_pelist(global_pelist, commID=global_commID) ! for commID
+ ! Need the fcst_mpi_comm in order to construct the communicators used by MPI_Bcast in fill_nested_grid_cpl()
+ call ESMF_VMGetCurrent(vm=vm,rc=rc)
Never easy.
Swapping in mpp_broadcast in place of MPI_Bcast in fill_nested_grid_cpl()
results in runtime failure.
subroutine fill_nested_grid_cpl(this_grid, proc_in)
- use mpi
+ !use mpi
+ use mpp_mod, only: mpp_broadcast
integer, intent(in) :: this_grid
logical, intent(in), optional :: proc_in
@@ -2460,7 +2461,8 @@ contains
g_dat(isg:,jsg:,1), position=CENTER)
endif
if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then
- call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+ !call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+ call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%sending_proc,Atm(this_grid)%Bcast_ranks)
endif
call timing_off('COMM_TOTAL')
if (process) then
The stack trace points to a part of the code I didn't change.
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: eval_move_nest:
move_cd_x= 0 move_cd_y= 0 do_move= F
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: Opened BC file:
INPUT/gfs_bndy.tile7.003.nc
nid001001.acorn.wcoss2.ncep.noaa.gov 0: forrtl: error (78): process
killed (SIGTERM)
nid001001.acorn.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000258B397
fv_regional_mod_m 3387 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000257BCD5
fv_regional_mod_m 1792 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000259E347
fv_regional_mod_m 1592 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000291AABA
atmosphere_mod_mp 673 atmosphere.F90
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVBAB4TBME2SXZ2JZITXGUDB7ANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I was able to get it working with mpp_broadcast() by playing games with mpp_set_current_pelist(). I'm running the regression suite now and will redo some of the performance testing to make sure it still helps. |
That is great. Thanks for continuing to work on this.
Lucas
…On Mon, May 22, 2023 at 10:44 PM dkokron ***@***.***> wrote:
I was able to get it working with mpp_broadcast() by playing games with
mpp_set_current_pelist(). I'm running the regression suite now and will
redo some of the performance testing to make sure it still helps.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMUQRVAAMZFNNF3EIWWINXDXHQQB5ANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I gathered more timings and updated the table. Using MPI_Bcast directly is significantly faster than using mpp_broadcast(). Please review the code and let me know if I'm using MPP incorrectly. |
@dkokron - what timings did you get using mpp_broadcast for comparison with the data in the PR description |
Scenario 3.
The underlying code in the PR now uses mpp_broadcast().
…On Wed, May 24, 2023 at 10:47 AM Rusty Benson ***@***.***> wrote:
@dkokron <https://github.com/dkokron> - what timings did you get using
mpp_broadcast for comparison with the data in the PR description
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2HQB7N7GROEBRO3HATXHYURHANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Given the significant difference in timings between scenarios 3 (mpp_braodcast) and 4 (MPI_Bcast), which proposal should we move froward with? |
What do you have in mind?
Let me know if you need me to test anything.
…On Wed, May 31, 2023 at 12:32 PM lharris4 ***@***.***> wrote:
Hello @dkokron <https://github.com/dkokron>. @bensonr
<https://github.com/bensonr> and I are discussing how to move forward.
There may be an underlying problem in the way the SST is being gathered in
the first place that isn't addressed by this fix.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2CXRAUZDX2IFSRBBPDXI56C7ANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dkokron - I have a test case that creates a 6 tile mosaic with a nest.
I'd like to run your exact sizes to get some timings on gaea. What core
counts and nest configuration are you using so I can have a good comparison.
|
The 26-node case is on acorn
/lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L
***@***.*** T2O_2020092200_17L > ll *.nml
-rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml
-rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml
-rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml
-rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml
lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml ->
input736_nest02.nml
lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml ->
input736.nml
The production case is on cactus.
/u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast
…On Thu, Jun 1, 2023 at 12:18 PM Rusty Benson ***@***.***> wrote:
@dkokron - I have a test case that creates a 6 tile mosaic with a nest.
I'd like to run your exact sizes to get some timings on gaea. What core
counts and nest configuration are you using so I can have a good
comparison.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2CNX5XEUP5ZP4FCRNLXJDFIFANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I gave up my access to NCEP/NCO when WCOSS2 came online and the rules and
restrictions became too much for me to maintain it. I'd appreciate if you
could give me the domain sizes (npx,npy,npz) and layouts for the global and
nest, as well as the nest_nml info for the two cases you are interested in
- thx
The 26-node case is on acorn
… /lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L
***@***.*** T2O_2020092200_17L > ll *.nml
-rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml
-rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml
-rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml
-rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml
lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml ->
input736_nest02.nml
lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml ->
input736.nml
The production case is on cactus.
/u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast
Message ID: ***@***.***
.com>
|
26-node case
Parent grid
layout = 16,28 ! LongEW
io_layout = 1,1
npx = 1321
npy = 1201
ntiles = 1
npz = 81
grid_pes = 448,288
Nest grid
layout = 12,24 ! LongEW
io_layout = 1,1
npx = 601
npy = 601
ntiles = 1
npz = 81
47-node case
------------------
Parent grid
layout = 36,50
io_layout = 1,1
npx = 1201
npy = 1201
ntiles = 1
npz = 81
grid_pes = 1800,840
Nest grid
layout = 20,42
io_layout = 1,1
npx = 601
npy = 601
ntiles = 1
npz = 81
…On Thu, Jun 1, 2023 at 12:39 PM Rusty Benson ***@***.***> wrote:
I gave up my access to NCEP/NCO when WCOSS2 came online and the rules and
restrictions became too much for me to maintain it. I'd appreciate if you
could give me the domain sizes (npx,npy,npz) and layouts for the global and
nest, as well as the nest_nml info for the two cases you are interested in
- thx
The 26-node case is on acorn
> /lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L
>
> ***@***.*** T2O_2020092200_17L > ll *.nml
> -rw-r--r-- 1 daniel.kokron support 8979 Mar 28 22:24 input720_nest02.nml
> -rw-r--r-- 1 daniel.kokron support 9416 Mar 28 22:24 input720.nml
> -rw-r--r-- 1 daniel.kokron support 9016 Apr 12 21:58 input736_nest02.nml
> -rw-r--r-- 1 daniel.kokron support 9453 Apr 19 16:55 input736.nml
> lrwxrwxrwx 1 daniel.kokron support 19 Mar 29 15:41 input_nest02.nml ->
> input736_nest02.nml
> lrwxrwxrwx 1 daniel.kokron support 12 Mar 29 15:41 input.nml ->
> input736.nml
>
> The production case is on cactus.
> /u/daniel.kokron/Scratch/HAFS/hafsv1_merge_hfsb/2021082712/09L/forecast
>
> Message ID: ***@***.***
> .com>
>
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2HEGP7M4GAEQR3TBTDXJDHTVANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for passing all of this along. I finally have the prototype test
code running where I create a top-level domain with a nest. Using your
data for the 47-node case (21 nodes on gaea with no OMP threads), I have
generated data for both 32-bit and 64-bit quantities. The sample sizes
were 25 and 50 respectively. The first iteration for each was a touch
longer as they set up the pelist and communicator.
For 64-bit data, the max average time for:
global gather on 1800 parent ranks is: 0.16s
broadcast from 1 parent rank to 840 nest ranks is: 0.058s
For 32bit data, the max average time for:
global gather on 1800 parent ranks is: 0.109s
broadcast from 1 parent rank to 840 nest ranks is: 0.029s
In your code you are swapping scope via the mpp_set_current_pelist call.
There is no need for the scope changes when using explicit pelists for mpp
communication.
do n=1,50
t_dum = mpp_get_tile_id(domain_coarse)
if (((t_dum(1) == tile_coarse(1)) .and.mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then
time_it = MPI_Wtime()
bcast_npes = mpp_get_domain_npes(domain_fine) + 1
allocate(bcast_pelist(bcast_npes))
bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse)
call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes))
call mpp_broadcast(xg, size(xg), bcast_pelist(1),bcast_pelist)
time_it = MPI_Wtime() - time_it
t_max=time_it
t_min=time_it
call mpp_max(t_max,bcast_pelist)
call mpp_min(t_min,bcast_pelist)
if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)') ' time for bcast ', t_min, t_max
deallocate(bcast_pelist)
endif
enddo
In the code above, all MPI-ranks enter the do-loop but only those
participating in the broadcast enter the if-test. The above code is not
100% generic, though you can simplify it immensely for use in
fill_nested_grid_cpl
All MPICH parameters were the defaults as defined by HPE/Cray, no MPICH tunables were utilized for this test.
|
Rusty,
I was getting runtime errors without the scoping. I think I see my error.
I am starting my bcast_npes indexing at zero not 1. More testing.
Dan
…On Fri, Jun 2, 2023 at 11:08 AM Rusty Benson ***@***.***> wrote:
Thanks for passing all of this along. I finally have the prototype test
code running where I create a top-level domain with a nest. Using your
data for the 47-node case (21 nodes on gaea with no OMP threads), I have
generated data for both 32-bit and 64-bit quantities. The sample sizes
were 25 and 50 respectively. The first iteration for each was a touch
longer as they set up the pelist and communicator.
For 64-bit data, the max average time for:
global gather on 1800 parent ranks is: 0.16s
broadcast from 1 parent rank to 840 nest ranks is: 0.058s
For 32bit data, the max average time for:
global gather on 1800 parent ranks is: 0.109s
broadcast from 1 parent rank to 840 nest ranks is: 0.029s
In your code you are swapping scope via the mpp_set_current_pelist call.
There is no need for the scope changes when using explicit pelists for mpp
communication.
do n=1,50
t_dum = mpp_get_tile_id(domain_coarse)
if (((t_dum(1) == tile_coarse(1)) .and.
mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then
time_it = MPI_Wtime()
bcast_npes = mpp_get_domain_npes(domain_fine) + 1
allocate(bcast_pelist(bcast_npes))
bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse)
call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes))
call mpp_broadcast(xg, size(xg), bcast_pelist(1),
bcast_pelist)
time_it = MPI_Wtime() - time_it
t_max=time_it
t_min=time_it
call mpp_max(t_max,bcast_pelist)
call mpp_min(t_min,bcast_pelist)
if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)')
' time for bcast ', t_min, t_max
deallocate(bcast_pelist)
endif
enddo
In the code above, all MPI-ranks enter the do-loop but only those
participating in the broadcast enter the if-test. The above code is not
100% generic, though you can simplify it immensely for use in
fill_nested_grid_cpl
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2AARAHAURBLMJQTE73XJIFYJANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Rusty,
Please send me your complete reproducer.
…On Fri, Jun 2, 2023 at 11:08 AM Rusty Benson ***@***.***> wrote:
Thanks for passing all of this along. I finally have the prototype test
code running where I create a top-level domain with a nest. Using your
data for the 47-node case (21 nodes on gaea with no OMP threads), I have
generated data for both 32-bit and 64-bit quantities. The sample sizes
were 25 and 50 respectively. The first iteration for each was a touch
longer as they set up the pelist and communicator.
For 64-bit data, the max average time for:
global gather on 1800 parent ranks is: 0.16s
broadcast from 1 parent rank to 840 nest ranks is: 0.058s
For 32bit data, the max average time for:
global gather on 1800 parent ranks is: 0.109s
broadcast from 1 parent rank to 840 nest ranks is: 0.029s
In your code you are swapping scope via the mpp_set_current_pelist call.
There is no need for the scope changes when using explicit pelists for mpp
communication.
do n=1,50
t_dum = mpp_get_tile_id(domain_coarse)
if (((t_dum(1) == tile_coarse(1)) .and.
mpp_domain_is_tile_root_pe(domain_coarse)) .or. (is_fine_pe)) then
time_it = MPI_Wtime()
bcast_npes = mpp_get_domain_npes(domain_fine) + 1
allocate(bcast_pelist(bcast_npes))
bcast_pelist(1) = mpp_get_domain_tile_root_pe(domain_coarse)
call mpp_get_pelist(domain_fine, bcast_pelist(2:bcast_npes))
call mpp_broadcast(xg, size(xg), bcast_pelist(1),
bcast_pelist)
time_it = MPI_Wtime() - time_it
t_max=time_it
t_min=time_it
call mpp_max(t_max,bcast_pelist)
call mpp_min(t_min,bcast_pelist)
if (mpp_pe()==bcast_pelist(1)) write(6,'(a,f10.6,2x,f10.6)')
' time for bcast ', t_min, t_max
deallocate(bcast_pelist)
endif
enddo
In the code above, all MPI-ranks enter the do-loop but only those
participating in the broadcast enter the if-test. The above code is not
100% generic, though you can simplify it immensely for use in
fill_nested_grid_cpl
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2AARAHAURBLMJQTE73XJIFYJANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure, I'll share it via your email.
|
***@***.*** or
***@***.***
…On Fri, Jun 2, 2023 at 2:09 PM Rusty Benson ***@***.***> wrote:
Sure, I'll share it via your email.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2CZEQAOCLRB3IM7Q5TXJI3AHANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Unfortunately, github blocked the addresses you just included. I was planning to send to your NOAA account, if that's okay with you. |
oops ***@***.*** (not gov)
NOAA account works too.
…On Fri, Jun 2, 2023 at 2:19 PM Rusty Benson ***@***.***> wrote:
Unfortunately, github blocked the addresses you just included. I was
planning to send to your NOAA account, if that's okay with you.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2D2RNEB5DW4Z4I74UDXJI4D3ANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is what I get when using mpp_broadcast() in HAFS without the scoping.
I really cannot imagine how that -448 is getting in there.
nid001022.acorn.wcoss2.ncep.noaa.gov 724: fill_nested_grid_cpl:
724
nid001022.acorn.wcoss2.ncep.noaa.gov 724: MPICH ERROR [Rank 724] [job id
65bd0239-958a-4cad-bd26-b2808f73b8ba] [Fri Jun 2 19:53:29 2023]
[nid001022] - Abort(940145670) (rank 724 in comm 0): Fatal error in
PMPI_Group_incl: Invalid rank, error stack:
nid001022.acorn.wcoss2.ncep.noaa.gov 724:
PMPI_Group_incl(170).............: MPI_Group_incl(group=0xc8000006, n=289,
ranks=0x7fffc9caf880, new_group=0x14713ad803f8) failed
nid001022.acorn.wcoss2.ncep.noaa.gov 724:
MPIR_Group_check_valid_ranks(244): Invalid rank in rank array at index 0;
value is -448 but must be in the range 0 to 287
nid001022.acorn.wcoss2.ncep.noaa.gov 724:
nid001022.acorn.wcoss2.ncep.noaa.gov 724: aborting job:
if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then
print*,'fill_nested_grid_cpl: ',mpp_pe()
call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%Bcast_ranks(1),
Atm(this_grid)%Bcast_ranks)
endif
…On Fri, Jun 2, 2023 at 2:19 PM Rusty Benson ***@***.***> wrote:
Unfortunately, github blocked the addresses you just included. I was
planning to send to your NOAA account, if that's okay with you.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2D2RNEB5DW4Z4I74UDXJI4D3ANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Submitted NOAA-GFDL/FMS#1246 to address the failure seen above when using mpp_broadcast(). Please consider using the pure MPI approach to address this performance issue. |
@junwang-noaa We are not using FMS for I/O anymore. We switched to ESMF multi-tile support for I/O. This will be available once we merge the ufs-community/ufs-weather-model#1845 |
@junwang-noaa - I know this is a little off topic for the dycore repo here... While you are working through that list, you still have the ability to compile in the deprecated io modules via a -Duse_deprecated_io flag (see FMS release notes). We test MOM6 as part of our FMS release testing and there are some updates from the main trunk that will need to be pulled into the ufs branch to resolve any issues. |
Rusty, thanks for the information! |
As far as GFDL_atmos_cubed_sphere is concerned, is this PR fit to move forward? |
As long as FMS is compiled with the flag -Duse_deprecated_io, this PR is fit to move forward. @jkbk2004 as UFS code manager, when do you think we can have this PR merged? |
We are working on Spack stack PR. For FMS-2023-02, JEDI team need to use -Duse_deprecated_io. A bit coordination and test is needed to confirm the test results. I think we can schedule to commit next week or so. But I will keep checking as test going on. |
In that case, should I extract the FMS-2023.02 porting modifications from my branch? |
I see no harm in keeping these changes in. |
model/fv_arrays.F90
Outdated
@@ -1843,6 +1848,7 @@ end subroutine allocate_fv_atmos_type | |||
|
|||
!>@brief The subroutine 'deallocate_fv_atmos_type' deallocates the fv_atmos_type. | |||
subroutine deallocate_fv_atmos_type(Atm) | |||
use mpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore since we are using mpp_broadcast() now instead of MPI_Bcast().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see this use statement removed prior to accepting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unneeded use of MPI module in model/fv_arrays.F90
Rusty,
I'm on pto until 3 Oct. I won't have github access until then.
…On Fri, Sep 29, 2023, 1:13 AM Rusty Benson ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In model/fv_arrays.F90
<#272 (comment)>
:
> @@ -1843,6 +1848,7 @@ end subroutine allocate_fv_atmos_type
***@***.*** The subroutine 'deallocate_fv_atmos_type' deallocates the fv_atmos_type.
subroutine deallocate_fv_atmos_type(Atm)
+ use mpi
Would like to see this *use* statement removed prior to accepting.
—
Reply to this email directly, view it on GitHub
<#272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2HPGFWN7WSELVCE6S3X4YADFANCNFSM6AAAAAAYC4OKZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bensonr are there anymore changes that you would like to see in this PR? |
The FMS modification that supports this PR was merged. See details at NOAA-GFDL/FMS#1246 |
@dkokron - at the bi-weekly UFS code managers meeting we have been telling them this is ready to merge - they have not yet blessed it. |
The UFS code managers will be ready to merge this change in in the coming weeks. @dkokron could you create a fv3atm PR and UFS-weather-model PR for this change? |
@laurenchilutti Will do. |
Opened PR against fv3atm |
Opened PR with UFS |
…stedGridCpl Using MPI broadcast instead of multiple point-to-points message to improve performance
Testing on #2059 has completed successfully, please proceed with the merge process for this PR. |
Copying in @laurenchilutti as well for this request. |
@bensonr can you merge this pr? |
Performance profiling of a HAFS case on NOAA systems revealed significant of time was spent in fill_nested_grid_cpl(). The fill_nested_grid_cpl() routine from FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90 is showing up as a performance bottleneck. This routine gathers a global SST field (6,336,000 bytes) onto rank 0, then broadcasts that field to all ranks in the nest. The code uses point-to-point (p2p) messages (Isend/Recv) from rank 0 to the nest ranks. This communication pattern is maxing out the SlingShot-10 link on the first node resulting in a .15s hit every fifth time step. This works out to 189 seconds for a full 5-day simulation. These timings are from the 26 node HAFS case on acorn.
Given the way its currently coded coded, I expect the bottleneck to grow with the number of ranks in the nest. Timings from a 47-node production case run for 24-hour simulation on cactus with and without using the code changes contained in this PR.
An offline reproducer showed that setting the following environment variables improved the performance of using MPI_Bcast in this scenario.
export MPICH_BCAST_ONLY_TREE=0
export MPICH_COLL_OPT_OFF=mpi_bcast
Scenarios:
Performance metric:
Add up the phase1 and phase2 timings printed in the output listing
grep PASS stdout | awk '{t+=$10;print t}' | tail -1
The units are seconds.
Based on these timings, I expect savings from scenario 3 (this PR) of ~(26s)*5=131 seconds for a full 5-day simulation.
Based on these timings, I expect savings from scenario 4 (previous version of this PR) of ~(66s)*5=332 seconds for a full 5-day simulation.
This PR does not address any open issues.
I ran the UFS regression suite on acorn and cactus. Both runs resulted in "REGRESSION TEST WAS SUCCESSFUL"
Checklist:
Please check all whether they apply or not