-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computation performs slower than cpu version in benchmark #246
Comments
You might want to look at benchmarking only the actual compute operation (not device initialization, copying data into buffers, etc.). You generally want to reuse as many GPU resources as possible so this might be impacting your benchmarks if they're not omitted. Even then depending on the size, it still might not beat the CPU version. It really depends on the exact kinds of computations you're doing. For sum operations specifically you might look into how to perform prefix sums on the GPU, then compare against prefix sums on the CPU (e.g., using SIMD). |
As the GPU has unified memory model, I suppose we are not counting in the cost of copying data into buffers. I tried to revamp the benchmark by reusing initialized device for all runs. Good thing is that there is some improvements about 30% on GPU runs. But it is still slower than CPU at significant scale.
So I guess that you're right that the point is the sum computation. I'm looking at prefix sum algorithm on GPU and see if it can improve the performance more. |
You still will have to copy into buffer, since the data needs to be properly aligned. |
Hi, I'm running some benchmarks between the computation code from metal-rs and a cpu version.
I basically benchmark the compute example which does
sum
operation and a cpu version which simply loops input slice while summing it up.I factor input data size as
1024 * factor
. For all cases, metal-rs compute always performs worse than the cpu version. E.g.,I'm wondering if the benchmark result is expected? Because I suppose metal version should speed up the operation and should be faster.
Do you have any idea or suggestion?
The text was updated successfully, but these errors were encountered: