Constructing a high resolution space on GPU fails #2096

juliasloan25 · 2024-12-03T23:18:05Z

Describe the bug

When I try to set up a space with nelements >= 200, I get the following error: ERROR: LoadError: Number of blocks in y-dimension exceeds device limit (240000 > 65535).

I found this online: "Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively." It looks like we're hitting the limit in the y dimension when we construct our space, but there should be plenty of space in the x dimension. Maybe the usage can be changed in ClimaCore?

To Reproduce

[on clima node]

srun --gpus=1 --mpi=none -t 02:00:00 --pty bash -l 
export CLIMACOMMS_DEVICE="CUDA"
module load common
julia --project=.buildkite

import ClimaComms
ClimaComms.@import_required_backends
import ClimaCore:
    Domains,
    Fields,
    Geometry,
    Meshes,
    Spaces,
    Topologies

FT = Float64
radius = FT(6378.1e3)
depth = FT(50)
nelements = (200, 15)
dz_tuple = FT.((10.0, 0.05))
npolynomial = 1

device = ClimaComms.device()
comms_ctx = ClimaComms.context()

vertdomain = Domains.IntervalDomain(
    Geometry.ZPoint(FT(-depth)),
    Geometry.ZPoint(FT(0));
    boundary_names = (:bottom, :top),
)
vertmesh = Meshes.IntervalMesh(
    vertdomain,
    Meshes.GeneralizedExponentialStretching{FT}(
        dz_tuple[1],
        dz_tuple[2],
    );
    nelems = nelements[2],
    reverse_mode = true,
)
vert_center_space = Spaces.CenterFiniteDifferenceSpace(device, vertmesh)

horzdomain = Domains.SphereDomain(radius)
horzmesh = Meshes.EquiangularCubedSphere(horzdomain, nelements[1])
horztopology = Topologies.Topology2D(comms_ctx, horzmesh)
quad = Spaces.Quadratures.GLL{npolynomial + 1}()
horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)

# Fails with `ERROR: Number of blocks in y-dimension exceeds device limit (240000 > 65535).`
subsurface_space = Spaces.ExtrudedFiniteDifferenceSpace(
    horzspace,
    vert_center_space,
)

Setup information

Using ClimaCore v0.14.20

[jsloan@clima ClimaCore.jl]$ module list
Currently Loaded Modulefiles:
 1) openmpi/4.1.5-mpitrampoline   2) julia/1.10.0   3) cuda/julia-pref   4) common

The text was updated successfully, but these errors were encountered:

Sbozzolo · 2024-12-05T18:53:43Z

Shorter reproducer:

import ClimaCore
center_space = ClimaCore.CommonSpaces.ExtrudedCubedSphereSpace(Float32;
                                                                      radius = 1.0,
                                                                      h_elem = 105,
                                                                      z_elem = 10,
                                                                      z_min = 1.0,
                                                                      z_max = 2.0,
                                                                      n_quad_points = 4,  staggering = ClimaCore.Grids.CellCenter())

Anything more than 104 fails

Sbozzolo · 2024-12-11T18:05:36Z

While this is worked on, we can work around by setting auto = true and passing nitems in Base.copyto! in data_layouts_copyto!. I haven't tried running a full simulation yet.

                args = (dest, bc, us)
                threads = threads_via_occupancy(knl_copyto!, args)
                n_max_threads = min(threads, get_N(us))
                p = partition(dest, n_max_threads)
                nitems = get_N(us)
                auto_launch!(
                    knl_copyto!,
                    args,
                    nitems;
                    auto = true,
                    threads_s = p.threads,
                    blocks_s = p.blocks,
                )

sriharshakandala · 2024-12-12T18:33:08Z

We have kernel launch patterns, that use the grid configuration (Nv, Nh) .

h_elem = 105 corresponds to Nh = 66,150 spectral elements, which exceeds the 65,535 limit for the second dimension of the CUDA grid. However, in the vertical direction we rarely use over 256 vertical levels, which translates to a Nv=16 or lower in most cases. (Nv is approximately equal to n_vertical_levels / 16 or lower)
We have the following options:

Flip to (Nh, Nv), as Nv is very small for most (almost all of our) use cases. We will hit this limit at 1,048,560 vertical levels with Nq=4, something we do not anticipate to use.
Move to one dimensional indexing (N,) and extract block indexes for h and v from the one-dimensional block id.

The first option is the easiest to use, unless we have a good reason to use the second option.

Sbozzolo · 2024-12-12T19:08:59Z

From a user point of view, I think we should try avoiding any limit in a foundational package like ClimaCore. We don't know what configurations users are going to set up, so I think there shouldn't be any artificial restriction on the maximum number of levels/elements one can place in either vertical or the horizontal direction. If we just swap Nh with Nv, this problem with come back when someone will try to run a high-vertical-resolution box or column.

sriharshakandala · 2024-12-12T19:32:02Z

Sure. With option 1, we can still loop over vertical level blocks once we hit the limit. It's primarily about significantly increasing the limit for Nh.

juliasloan25 added the bug Something isn't working label Dec 3, 2024

juliasloan25 assigned sriharshakandala Dec 6, 2024

sriharshakandala linked a pull request Dec 13, 2024 that will close this issue

Swap CUDA grid dimensions for some partitions #2100

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constructing a high resolution space on GPU fails #2096

Constructing a high resolution space on GPU fails #2096

juliasloan25 commented Dec 3, 2024

Sbozzolo commented Dec 5, 2024

Sbozzolo commented Dec 11, 2024

sriharshakandala commented Dec 12, 2024

Sbozzolo commented Dec 12, 2024 •

edited

Loading

sriharshakandala commented Dec 12, 2024 •

edited

Loading

Constructing a high resolution space on GPU fails #2096

Constructing a high resolution space on GPU fails #2096

Comments

juliasloan25 commented Dec 3, 2024

Describe the bug

To Reproduce

Setup information

Sbozzolo commented Dec 5, 2024

Sbozzolo commented Dec 11, 2024

sriharshakandala commented Dec 12, 2024

Sbozzolo commented Dec 12, 2024 • edited Loading

sriharshakandala commented Dec 12, 2024 • edited Loading

Sbozzolo commented Dec 12, 2024 •

edited

Loading

sriharshakandala commented Dec 12, 2024 •

edited

Loading