Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

zasdfgbnm · 2024-10-25T18:37:25Z

The CTA tile size (128, 256) is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.

Benchmark command:

nsys nvprof ./bin/test_matmul --gtest_filter=HopperMatmulTest.HSH_NT_128BSwizzle

Current perf on H100 on main as in the latest comment:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.8           136319          1  136319.0  136319.0    136319    136319          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.7            91487          1   91487.0   91487.0     91487     91487          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN

nvFuser/cuBLAS: 67%

The text was updated successfully, but these errors were encountered:

zasdfgbnm · 2024-10-25T20:59:43Z

On H200:

Initial perf as measured in #3281:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 42.7%

This shape makes more sense: #3137 (comment), #3279 Perf: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 43.2 205150 1 205150.0 205150.0 205150 205150 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 18.5 87550 1 87550.0 87550.0 87550 87550 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT ``` nvFuser/cuBLAS = `42.7%`

zasdfgbnm · 2024-10-28T22:46:33Z

On H200:

There is a perf regression after the fix of elect-sync:
#3295

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     47.8           247326          1  247326.0  247326.0    247326    247326          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     17.0            88191          1   88191.0   88191.0     88191     88191          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 35.6%

zasdfgbnm · 2024-10-29T17:15:03Z

On H200:

After #3294:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     39.0           172735          1  172735.0  172735.0    172735    172735          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.0            88768          1   88768.0   88768.0     88768     88768          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 51.4%.

zasdfgbnm · 2024-10-31T18:07:06Z

On H200:

After #3314

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.

When used with #3545, this contribute a speedup of 5% of cuBLAS! Perf together with #3545 on H100: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 33.8 136319 1 136319.0 136319.0 136319 136319 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 22.7 91487 1 91487.0 91487.0 91487 91487 0.0 nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN ``` nvFuser/cuBLAS: 67% Note that the above test is run with smem epilogue disabled. I will run a test with everything combined later. Also note that this number is on H100, which is different from the H200 in #3279.

zasdfgbnm · 2024-12-10T17:50:54Z

Unfortunately, I no longer get H200 machines from our cluster. Here is the result on H100 after #3545 and #3547:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.8           136319          1  136319.0  136319.0    136319    136319          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.7            91487          1   91487.0   91487.0     91487     91487          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN

nvFuser/cuBLAS: 67%

zasdfgbnm mentioned this issue Oct 25, 2024

Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle for problem size (M=16384, N=16384, K=1024), CTA tile size (64, 256) #3137

Closed

zasdfgbnm added Matmuls H100 Perf improve performance on H100 labels Oct 25, 2024

zasdfgbnm mentioned this issue Oct 25, 2024

Change shape of HSH_NT_128BSwizzle #3281

Merged

kevinstephano assigned zasdfgbnm Oct 30, 2024

zasdfgbnm mentioned this issue Nov 17, 2024

Add support for stmatrix in the unit test HopperMatmulTest/HSH_NT_128BSwizzle #3411

Merged

zasdfgbnm mentioned this issue Dec 9, 2024

Unroll warp-specialized loops #3547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

zasdfgbnm commented Oct 25, 2024 •

edited

Loading

zasdfgbnm commented Oct 25, 2024 •

edited

Loading

zasdfgbnm commented Oct 28, 2024 •

edited

Loading

zasdfgbnm commented Oct 29, 2024 •

edited

Loading

zasdfgbnm commented Oct 31, 2024 •

edited

Loading

zasdfgbnm commented Dec 10, 2024 •

edited

Loading

Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle for problem size (M=2048, N=2048, K=8192), CTA tile size (128, 256) #3279

Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle for problem size (M=2048, N=2048, K=8192), CTA tile size (128, 256) #3279

Comments

zasdfgbnm commented Oct 25, 2024 • edited Loading

zasdfgbnm commented Oct 25, 2024 • edited Loading

zasdfgbnm commented Oct 28, 2024 • edited Loading

zasdfgbnm commented Oct 29, 2024 • edited Loading

zasdfgbnm commented Oct 31, 2024 • edited Loading

zasdfgbnm commented Dec 10, 2024 • edited Loading

Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

Tracking perf optimization of `HopperMatmulTest.HSH_NT_128BSwizzle` for problem size `(M=2048, N=2048, K=8192)`, CTA tile size `(128, 256)` #3279

zasdfgbnm commented Oct 25, 2024 •

edited

Loading

zasdfgbnm commented Oct 25, 2024 •

edited

Loading

zasdfgbnm commented Oct 28, 2024 •

edited

Loading

zasdfgbnm commented Oct 29, 2024 •

edited

Loading

zasdfgbnm commented Oct 31, 2024 •

edited

Loading

zasdfgbnm commented Dec 10, 2024 •

edited

Loading