Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle for problem size (M=2048, N=2048, K=8192), CTA tile size (128, 256) #3279

Open
zasdfgbnm opened this issue Oct 25, 2024 · 5 comments
Assignees
Labels
H100 Perf improve performance on H100 Matmuls

Comments

@zasdfgbnm
Copy link
Collaborator

zasdfgbnm commented Oct 25, 2024

The CTA tile size (128, 256) is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.

Benchmark command:

nsys nvprof ./bin/test_matmul --gtest_filter=HopperMatmulTest.HSH_NT_128BSwizzle

Current perf on H100 on main as in the latest comment:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.8           136319          1  136319.0  136319.0    136319    136319          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.7            91487          1   91487.0   91487.0     91487     91487          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN

nvFuser/cuBLAS: 67%

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 25, 2024

On H200:

Initial perf as measured in #3281:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 42.7%

zasdfgbnm added a commit that referenced this issue Oct 26, 2024
This shape makes more sense:
#3137 (comment),
#3279

Perf:
```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
```

nvFuser/cuBLAS = `42.7%`
@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 28, 2024

On H200:

There is a perf regression after the fix of elect-sync:
#3295

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     47.8           247326          1  247326.0  247326.0    247326    247326          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     17.0            88191          1   88191.0   88191.0     88191     88191          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 35.6%

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 29, 2024

On H200:

After #3294:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     39.0           172735          1  172735.0  172735.0    172735    172735          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.0            88768          1   88768.0   88768.0     88768     88768          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 51.4%.

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 31, 2024

On H200:

After #3314

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.

zasdfgbnm added a commit that referenced this issue Dec 10, 2024
When used with #3545, this
contribute a speedup of 5% of cuBLAS!

Perf together with #3545 on H100:

```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.8           136319          1  136319.0  136319.0    136319    136319          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.7            91487          1   91487.0   91487.0     91487     91487          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN
```

nvFuser/cuBLAS: 67%

Note that the above test is run with smem epilogue disabled. I will run
a test with everything combined later. Also note that this number is on
H100, which is different from the H200 in
#3279.
@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Dec 10, 2024

Unfortunately, I no longer get H200 machines from our cluster. Here is the result on H100 after #3545 and #3547:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.8           136319          1  136319.0  136319.0    136319    136319          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.7            91487          1   91487.0   91487.0     91487     91487          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN

nvFuser/cuBLAS: 67%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H100 Perf improve performance on H100 Matmuls
Projects
None yet
Development

No branches or pull requests

1 participant