computation performs slower than cpu version in benchmark #246

viirya · 2022-10-20T18:15:55Z

Hi, I'm running some benchmarks between the computation code from metal-rs and a cpu version.

I basically benchmark the compute example which does sum operation and a cpu version which simply loops input slice while summing it up.

I factor input data size as 1024 * factor. For all cases, metal-rs compute always performs worse than the cpu version. E.g.,

sum (metal), factor: 90 time:   [465.91 µs 479.72 µs 495.96 µs]                                      
                        change: [-2.5272% +1.6099% +6.1871%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

sum (cpu), factor: 90   time:   [32.132 µs 32.140 µs 32.148 µs]

I'm wondering if the benchmark result is expected? Because I suppose metal version should speed up the operation and should be faster.

Do you have any idea or suggestion?

The text was updated successfully, but these errors were encountered:

grovesNL · 2022-10-20T19:16:07Z

You might want to look at benchmarking only the actual compute operation (not device initialization, copying data into buffers, etc.). You generally want to reuse as many GPU resources as possible so this might be impacting your benchmarks if they're not omitted.

Even then depending on the size, it still might not beat the CPU version. It really depends on the exact kinds of computations you're doing. For sum operations specifically you might look into how to perform prefix sums on the GPU, then compare against prefix sums on the CPU (e.g., using SIMD).

viirya · 2022-10-21T20:36:31Z

As the GPU has unified memory model, I suppose we are not counting in the cost of copying data into buffers.

I tried to revamp the benchmark by reusing initialized device for all runs. Good thing is that there is some improvements about 30% on GPU runs. But it is still slower than CPU at significant scale.

sum (metal), factor: 90 time:   [295.16 µs 301.82 µs 308.61 µs]                                                    
                        change: [-37.171% -34.824% -32.562%] (p = 0.00 < 0.05)                                     
                        Performance has improved.                                                                  
                                                                                                                   
sum (cpu), factor: 90   time:   [28.919 µs 28.923 µs 28.927 µs]                                                    
                        change: [-0.2977% -0.0365% +0.1904%] (p = 0.80 > 0.05)                                     
                        No change in performance detected.

So I guess that you're right that the point is the sum computation. I'm looking at prefix sum algorithm on GPU and see if it can improve the performance more.

Congyuwang · 2022-12-13T04:52:19Z

You still will have to copy into buffer, since the data needs to be properly aligned.

grovesNL added the question label Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computation performs slower than cpu version in benchmark #246

computation performs slower than cpu version in benchmark #246

viirya commented Oct 20, 2022

grovesNL commented Oct 20, 2022

viirya commented Oct 21, 2022

Congyuwang commented Dec 13, 2022

computation performs slower than cpu version in benchmark #246

computation performs slower than cpu version in benchmark #246

Comments

viirya commented Oct 20, 2022

grovesNL commented Oct 20, 2022

viirya commented Oct 21, 2022

Congyuwang commented Dec 13, 2022