-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel cyclic reduction #2330
Conversation
…pcr_cyclic_copy+asan
Same tests still fail...
TODO This is the commit you probably want to revert to get back to next's cyclic
Also tidy that function's docs All tests pass
/clang-format |
Should we name this one "cyclic" and rename the direct partitioned one to something else? We might also want to put the citation for this solver into |
I think on balance renaming the old one (to |
There are some limitations:
Other possible enhancements
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 89. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 66. Check the log or trigger a new build to see more.
outbndry = 1; | ||
|
||
if (dst) { | ||
BOUT_OMP(parallel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: if with identical then and else branches [bugprone-branch-clone]
if (dst) {
^
src/invert/laplace/impls/pcr/pcr.cxx:225:5: note: else branch starts here
} else {
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, I think it's just confused by the openmp loop turning into a function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 41. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Identifiers beginning with double-underscore are technically reserved
- multiple declarations on single line - implicit conversion to bool - braces around if/for statements
xproc = localmesh->getXProcIndex(); // Local rank in x proc space | ||
const int yproc = localmesh->getYProcIndex(); | ||
nprocs = localmesh->getNXPE(); // Number of processors in x | ||
myrank = yproc * nprocs + xproc; // Global rank for communication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just be BoutComm::rank()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
What do we want to do about the names? I'm happy for this to go in now, and we can sort out the names in another PR |
PCR is a parameter-free direct solver, so it is a like-for-like replacement for The way around this to make the solver node-aware: solving within a node using (something like) the Thomas algorithm or the old cyclic (that doesn't care about point or core counts), and between nodes with PCR. This would only require using 2^n nodes, which is much less restrictive. The "within a node" bit will require work beyond this PR though. The Kang library is set up to do Thomas + PCR but it only has MPI so I doubt it would be node-aware out-of-the-box. Doing old cyclic on-node and PCR off-node might be nice though (and wouldn't require OpenMP (or SYCL 😛 )) |
After a bit of code-digging:
|
BoutReal zlen = coords->dz * (localmesh->LocalNz - 3); | ||
BoutReal kwave = | ||
kz * 2.0 * PI / (2. * zlen); // wave number is 1/[rad]; DST has extra 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like "DST has extra 2" means that there are two extra cells, but why is then nz - 3
used?
As mentioned in #2296, cyclic reduction wasn't actually performing cyclic reduction. Also, as noted in #2314, its scaling with processor count on Archer2 is much worse than we've seen on other machines.
This PR adds a parallel cyclic reduction solver, incorporating code from this repo, published under MIT. It can be selected using
type=pcr
.This solver has a much better scaling performance than "old cyclic" while still being a parameter-free direct solver. The graph below shows run times on Archer2 (with 128 cores per node) for blob2d with (nx,nz)=(8192,1024) and rk4 time advance. Old cyclic (in magenta) doesn't scale nearly so well as parallel cyclic reduction (in red). The multigrid-Thomas solver (blue) is available in BOUT++ as
type=ipt
(impls/iterative_parallel_tri
), while multigrid (green) is not the multigrid innext
- it hasn't been pushed yet. However, PCR is a good drop-in replacement for old cyclic, without the need to fiddle with parameters (and possible non-convergence) that comes with multigrid.