-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blk_1m: A not-finite number detected in: RHS of rc after rc_src #103
Comments
I tested a little and it seems that this issue and #102 might be the symptoms of the same problem. Running the above commands sometimes leads to the error and sometimes leads to simulation timing out without any error (not even the stuck in the pressure solver error). From the plots of the results it looks like the In general, I don't mind if the simulation runs a little bit longer. But getting randomly stuck for some parameter combinations is a problem |
As promised I tested the convergence for the 2D simulations for Dycoms with blk-1m scheme. This is what I get: The top left plot shows the wall times for the whole simulation depending on the pressure solver tolerance. The colors from that plot are used to mark the rest of the profiles. The profiles are averaged over last hour (4 model outputs). I didn't do any ensemble averaging, so it should all be taken with a grain of salt. But it looks like the 1e-6 tolerance is the borderline. And we should not go to lower tolerances (i.e. larger prs_tol numbers). The timing out issue I mentioned in #102 looks more and more like a cluster hardware issue... |
Still debugging. But I talked with other cluster users and it seems like their random node failure rate is much lower than mine. So it seems we do have some random bug somewhere that is especially prominent in my 2D blk1m simulations. |
The convergence tests are nice, thanks for doing them. Could you test if you still get the error: |
Thanks for the hints! I'll be debugging this week(end). I was also thinking that maybe something is wrong with the 2D setup. I'll try to run a small 3D simulation and see if I get similar errors. Same for the issue #105 - I got used to some negative values in rr or even rc. But rv and th are not acceptable :) |
For a job:
OMP_NUM_THREADS=32 bicycles --outdir=outdir --case=dycoms_rf02 --nx=129 --ny=0 --nz=301 --dt=1 --spinup=3600 --nt=25200 --micro=blk_1m --outfreq=900 --backend=serial --r_c0=0.000711222222222 --rng_seed=42 --prs_tol=5e-5
I get an error: A not-finite number detected in: RHS of rc after rc_src
(-2,2) x (-2,298)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.24152e-12 6.5588e-11 4.10868e-10 6.10279e-10 5.16959e-10 3.64688e-10 2.21308e-10 3.41213e-11 -4.31171e-10 -7.08708e-10 -6.58321e-10 -5.48502e-10 -3.66218e-10 -1.11794e
...
]
I get a similar error for rc0 = 0.000622444444444
but this time its due to one nan value
A not-finite number detected in: RHS of rc after rc_src
(-2,1) x (-2,298)
[ 0 0 0 0 0
...
0 0 0 0 0 0 -nan 0 0 0 0 0 0
...
]
Any clues?
The text was updated successfully, but these errors were encountered: