Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuSPARSELt matmul example not working on M=N=K8192 #203

Open
OrenLeung opened this issue Jul 30, 2024 · 12 comments
Open

cuSPARSELt matmul example not working on M=N=K8192 #203

OrenLeung opened this issue Jul 30, 2024 · 12 comments

Comments

@OrenLeung
Copy link

on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul

the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runtime error. any pointers or patches on how to fix it?

CUSPARSE API failed at line 191 with error: operation not supported (10)
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/matmul/matmul_example.cpp#L116-L118

@fbusato
Copy link
Collaborator

fbusato commented Jul 30, 2024

@OrenLeung a couple of questions to better understand your issue.

  • Are you using the latest cuSPARSELt version?
  • What CUDA version are you using? what OS?
  • Did you change anything else in the code?
  • Any chance that you can run compute-sanitizer and valgrind to check that all operations are valid in your system?

@OrenLeung
Copy link
Author

OrenLeung commented Jul 30, 2024

hi @fbusato , thanks for the quick reply.

I didn't change anything else in the code, just the m,n,k vars. I was able to compile & run the matmul example with default m,n,k vars.

  • i am running cuda12.5 with 550 drivers
  • i am on ubuntu22
  • i am on HGX H100 SXM and the chassis manufacturer is dell
  • the latest cuSPARSELt
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcusparselt0 libcusparselt-dev

I have double checked that my cusparse .so is at my cuda_home

ls /usr/local/cuda/lib64/libcusparse
libcusparse.so            libcusparseLt.so          libcusparseLt_static.a
libcusparse.so.12         libcusparseLt.so.0        
libcusparse.so.12.5.1.3   libcusparseLt.so.0.5.2.1

@fbusato
Copy link
Collaborator

fbusato commented Jul 30, 2024

it seems that you are using cuSPARSELt 0.5.2.1 which doesn't support Hopper https://docs.nvidia.com/cuda/cusparselt/release_notes.html libcusparseLt.so.0.5.2.1
My suggestion is to manually download and install the latest version here https://developer.nvidia.com/cusparselt-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

@OrenLeung
Copy link
Author

Hi @fbusato

Thanks for the suggestion, I have now correctly symlinked to cuSPARSELt v0.6.2 using your suggestion. I have verifed that the provided m,n,k in the example works properly and does not deadlock.

But unfortunately for m=n=k=8192, I am deadlocked, it seems like it is deadlocked on a half to float convertion __internal_half2float . Strange.

I have also double checked that m,n,k is the only thing i changed.

image

image

@fbusato
Copy link
Collaborator

fbusato commented Jul 31, 2024

Hi @OrenLeung, the 'deadlock' you observe is due to the long computation time on the host side (correctness) for large matrices. If you want to speed up the process, my suggestion is to use cuBLAS to compute the matrix multiplication on the GPU.

@OrenLeung
Copy link
Author

OrenLeung commented Jul 31, 2024

hi @fbusato

Thanks for your suggestion! I have now got it working on but unfortunately the realized TFLOP/s of nowhere close to the peak theoretical sparse TFLOP/s. Do you have any tips on how to improve the cuSPARSE performance?

realized sparse cuSPARSELt fp16: 1005 TFLOP/s out of the peak theoretical 1,979
realized dense cuBLAS fp16: 870 TFLOP/s out of the peak theoretical 979.5

this menas there is only around a 15% realized improvement. Although no one was expecting the claimed 2x imrpovement, one would expect closer to a 40-50% realized improvement. On A100, Nvidia claims that the speed up for big GEMMs is 1.6-1.8x https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/

Attached is my script to benchmarking 8192x8192x8192 cuSPARSE 2:4 semi structured 16 sparsity vs cuBLAS fp16 dense gemms on h100. I have ensured that I am benchmarking gpu time through cudaevents and i am on the latest cuSPARSE version.
OrenLeung@e3cfb07

@fbusato
Copy link
Collaborator

fbusato commented Jul 31, 2024

there are several things to consider when benchmarking cuSPARSELt. You should nsight-system (or cupti) to get more reliable time measurement. Second, you need to run the autotuning functionality, see the other example. Other points to consider: run some warm-up runs, lock gpu sm/memory clock, disable autoboost, ensure there is no power/thermal throttling, disable cpu turboboost, set cpu governor to performance, etc.

@OrenLeung
Copy link
Author

hi fbusato,

thanks for your suggestion.

  1. I believe i am already running the autotuning function cusparseLtMatmulSearch. is there another function that I am missing? https://github.com/OrenLeung/CUDALibrarySamples/blob/e3cfb07e6b6625ec33b8526d82bebd5a21185624/cuSPARSELt/matmul/matmul_example.cpp#L348
  2. i have already locked the gpu clock speed sudo nvidia-smi -i 0 --lock-gpu-clocks=1830,1830
  3. as you may be aware, due to throttling, the TFLOP/s get every worst after (i have included a time.sleep between benchmarking sparse and dense to allow the gpu to cool down). Even with warmup, the perf delta is still around 15%

image

@OrenLeung
Copy link
Author

It seems when changing the inputs to a normal distribution centered around 0, then the sparse performance gets a bit better with 20% improvement over dense. OrenLeung@9cabba4

# median of 5000 iterations with removing first 100 iterations 
Dense Median: 642.971 TFLOP/s
Sparse Median: 768.348 TFLOP/s

image

@fbusato
Copy link
Collaborator

fbusato commented Jul 31, 2024

@OrenLeung we evaluated the same sparse GEMM operation on our systems, default clocks. We observed 1.38x speedup (sparse vs. dense) on a H100 350W and 1.22x on H100 800W.

@OrenLeung
Copy link
Author

@fbusato thanks for running it. by "800W h100", you mean 700W right? we also see around 1.20-1.22x improvement too.

Would you have any suggestions on shapes where sparsity would show the biggest gain compared to dense?

@fbusato
Copy link
Collaborator

fbusato commented Jul 31, 2024

I don't have any specific suggestions other than to try different shapes and data types. The results are affected by different GPU models, clock settings, and cuda version, so it is hard to give exact sizes. The main engineer is OOTO, and he will be back in 2w. He can help you better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants