-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PCR Thomas Laplacian inversion solver #2340
Conversation
Currently just a copy of pcr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
} | ||
|
||
if (dst) { | ||
BOUT_OMP(parallel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: if with identical then and else branches [bugprone-branch-clone]
if (dst) {
^
src/invert/laplace/impls/pcr_thomas/pcr_thomas.cxx:230:5: note: else branch starts here
} else {
^
BoutReal zlen = coords->dz * (localmesh->LocalNz - 3); | ||
BoutReal kwave = | ||
kz * 2.0 * PI / (2. * zlen); // wave number is 1/[rad]; DST has extra 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to what I asked for the pcr
- why is there a 3 but the comment says 2?
Besides that, would that also work if dz
is a Field2D
?
I assume variation of dz
along z
wouldn't work due to FFTs require being uniform ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
This would work best when transformed into a node-aware solver. The best way to do that might be running one MPI rank per node, and then filling up the node with OpenMP threads. So, we might want to wait for the outer-loop stuff in order to get the best performance. |
* next: (1519 commits) Remove some useless forward declarations Revert "Remove some unused headers from vecops" Remove some unused headers from vecops Fix bad merge: missed removing extraneous `const` on return value Remove `Up/DownXSplitIndex` Remove last remaining uses of `deprecated.hxx` Apply clang-format Apply clang-format changes CI: Don't update submodules (already updated) CI: Workaround for git ownership checks in container CI: Bump actions/checkout version CI: Use latest version of cmake in clang-tidy-review Use MPI_COMM_NULL rather than nullptr Add 'u' format flag to BOUT.settings output Adds a 'u' format, commenting unused options Apply clang-format changes Apply clang-format changes Don't compile LaplaceXY2 with 3D metrics Apply clang-format changes Apply clang-format changes ...
- implicit narrowing conversion - C-style array - loop variable name too short - multiple declarations on same line - dead stores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
This adds another direct parallel solver,
pcr_thomas
, based on Laszlo, Gilles and Appleyard, Manycore Algorithms for Batch Scalar and Block Tridiagonal Solvers, ACM TOMS, 42, 31 (2016) and adapted from the parallel_tdma_cpp library.It is very similar to the parallel cyclic reduction solver (#2330), except instead of performing PCR directly on the distributed tridiagonal matrix, it uses the Thomas algorithm on each core to get a reduced tridiagonal system with 2 equations per processor. It then calls PCR on this reduced system. This reduces the number of communications from
O(log2(nx))
toO(log2(nxpe))
, without changing theO(nx)
complexity.This solver passes the Laplace integrated tests, but I haven't done any performance analysis on it. This is really intended as a stepping stone to getting a node-aware solver, which would further reduce the communication count to
O(log2(nnodes))
inter-node communications. To do this, we need to replace the Thomas solver with a parallel direct solver and work out how to extract the coefficients to feed into a node-level PCR. This PR gives the code structure to do this, but since it also adds a complete solver, I thought I'd add it as an option.