Draft: Catch2 Benchmarking #1723

sliwowitz · 2022-05-12T15:31:17Z

This is an example of using Catch2 facilities for benchmarking.

Putting this into Draft mode, since it's still WIP. It compiles, runs, but returns a wrong result, and probably also measures stuff we don't really want to measure, but I want this out so others can share their comments.

I had to create another fixture for the benchmarks, based on the earlier KernelExecutionFixture. I thought about inheritance - it didn't work out for me on the first try, but maybe there's a way.

One catch with Catch2 benchmarks is that internally it runs the BENCHMARK marked code many times first to estimate the runtime, and the collect enough data for meaningful statistics (this they call iterations, and it can't be changed without modifying Catch2 sources). This is why my KernelExecutionBenchmarkFixture first sets the memory up (a potentially lengthy operation depending on what we want to measure in the next step) outside the BENCHMARK area. Inside the BENCHMARK, the memory is cleared/memset/whatever, because that part will be re-run multiple times. After resetting the memory, there is a meter.measure([&]{...}); call which encapsulates the part of BENCHMARK that is actually to be measured.

You can build the benchmarks with alpaka_BUILD_BENCHMARK=ON. The executable will live in test/benchmark/rand/randBenchmark. If you run it, it will collect 100 samples that is - it will run each benchmark 100*i times, where i is the number of iterations auto-estimated by Catch2 - it should be something between 1-3. If you just want to see whether the benchmarks run, you can pass a parameter on the command line: test/benchmark/rand/randBenchmark --benchmark-samples=1 (benchmark-samples=1 is also set if running in CI).

Known issues:

I'm likely mishandling the input/output parameters so the results (marked debug temp in the output) which should all be around 0.5 are actually not.
The fixture is hardcoded to just use a single float to communicate any data back to the test's cpp.
For the benchmark to be meaningful, we should also find a good way to set up the WorkDiv according to the accelerator we're using.
CI isn't yet building/running the benchmarks.

bernhardmgruber · 2022-05-19T07:54:36Z

include/alpaka/test/KernelExecutionBenchmarkFixture.hpp

+#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && !BOOST_LANG_CUDA
+#    error If ALPAKA_ACC_GPU_CUDA_ENABLED is set, the compiler has to support CUDA!
+#endif
+
+#if defined(ALPAKA_ACC_GPU_HIP_ENABLED) && !BOOST_LANG_HIP
+#    error If ALPAKA_ACC_GPU_HIP_ENABLED is set, the compiler has to support HIP!
+#endif


I dislike those. Can't we just have a prelude in alpaka.hpp after BoostPredef that checks those in one place?

As long as it takes ALPAKA_HOST_ONLY into account.

bernhardmgruber · 2022-05-19T07:55:06Z

include/alpaka/test/KernelExecutionBenchmarkFixture.hpp

+
+namespace alpaka::test
+{
+    //! The fixture for executing a kernel on a given accelerator.


Suggested change

//! The fixture for executing a kernel on a given accelerator.

//! The fixture for benchmarking the execution of a kernel on a given accelerator.

sliwowitz · 2022-05-19T09:25:42Z

About the fixture - I don't think we can provide a universal benchmark fixture as we discussed earlier - i.e. one that would execute the kernel and pass some pre-allocated buffers which were set up in the user code benchmark cpp code (i.e. RandBenchmarkKernel).

The issue is two-fold:

we need devAcc and devHost which are now initialized inside the KernelExecutionBenchmarkFixture and we'd have to pass these to the user object RandBenchmarkKernel
- of course we can pass them along, but then the KernelExecutionBenchmarkFixture is basiclly just doing setUp -> measure -> tearDown
if we store the buffers inside RandBenchmarkKernel our KernelExecutionBenchmarkFixture isn't really a fixture, since the data is actually held in RandBenchmarkKernel.
- it might be just an issue of terminology, but I feel it points to a code smell

j-stephan · 2023-06-21T10:09:11Z

Are you still working on this @sliwowitz?

sliwowitz · 2023-06-21T12:12:38Z

Yes. I got stuck on the KernelExecutionBenchmarkFixture idea. I wanted to make it a general object usable for other benchmarks than the simple example benchmark, but it's still unclear to me how to handle arbitrary inputs/outputs. I'll rebase on develop, and take a look at this next week.

SimeonEhrig · 2024-02-06T07:28:57Z

I checked the output options again. Last time, we had the problem, that the output was not machine readable but I found some documentation about the usage of the reporter: https://github.com/catchorg/Catch2/blob/devel/docs/reporters.md

I tested your benchmark with XML reporter:

$ build/ninja-omp2b-gcc-release/test/benchmark/rand/randBenchmark --reporter XML
<?xml version="1.0" encoding="UTF-8"?>
<Catch2TestRun name="randBenchmark" rng-seed="645286256" xml-format-version="2" catch2-version="3.3.2">
  <TestCase name="defaultRandomGeneratorBenchmark" tags="[randBenchmark]" filename="/home/simeon/projects/alpaka/test/benchmark/rand/src/randBenchmark.cpp" line="53">
    <BenchmarkResults name="Random sequence N=10" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="8.6125e+06">
      <!-- All values in nano seconds -->
      <mean value="89822.5" lowerBound="85849.8" upperBound="103189" ci="0.95"/>
      <standardDeviation value="33361.6" lowerBound="10991" upperBound="75389.8" ci="0.95"/>
      <outliers variance="0.98889" lowMild="2" lowSevere="0" highMild="2" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=100000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="7.2092e+06">
      <!-- All values in nano seconds -->
      <mean value="131106" lowerBound="97376" upperBound="287445" ci="0.95"/>
      <standardDeviation value="317164" lowerBound="12744.5" upperBound="753666" ci="0.95"/>
      <outliers variance="0.989974" lowMild="0" lowSevere="0" highMild="0" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=1000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.53628e+07">
      <!-- All values in nano seconds -->
      <mean value="229560" lowerBound="223253" upperBound="240870" ci="0.95"/>
      <standardDeviation value="41958.1" lowerBound="25405.3" upperBound="79203.3" ci="0.95"/>
      <outliers variance="0.935867" lowMild="11" lowSevere="0" highMild="0" highSevere="1"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=10000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.02668e+08">
      <!-- All values in nano seconds -->
      <mean value="1.57844e+06" lowerBound="1.32217e+06" upperBound="2.17723e+06" ci="0.95"/>
      <standardDeviation value="1.87999e+06" lowerBound="702312" upperBound="3.27425e+06" ci="0.95"/>
      <outliers variance="0.989892" lowMild="0" lowSevere="0" highMild="1" highSevere="3"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=100000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.00224e+09">
      <!-- All values in nano seconds -->
      <mean value="1.02198e+07" lowerBound="1.01508e+07" upperBound="1.03973e+07" ci="0.95"/>
      <standardDeviation value="515800" lowerBound="116904" upperBound="994951" ci="0.95"/>
      <outliers variance="0.484665" lowMild="2" lowSevere="0" highMild="1" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=1000000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.10758e+10">
      <!-- All values in nano seconds -->
      <mean value="1.04739e+08" lowerBound="1.03648e+08" upperBound="1.06501e+08" ci="0.95"/>
      <standardDeviation value="6.91494e+06" lowerBound="4.89068e+06" upperBound="9.9287e+06" ci="0.95"/>
      <outliers variance="0.625317" lowMild="2" lowSevere="0" highMild="0" highSevere="19"/>
    </BenchmarkResults>
    <OverallResult success="true" skips="0">
      <StdOut>
Hardware threads: 64

temp debug normalized result = 18.7131 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 18.7981 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 9.672 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 1.64295 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 0.623814 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 0.500023 should probably converge to 0.5.
      </StdOut>
    </OverallResult>
  </TestCase>
  <OverallResults successes="6" failures="0" expectedFailures="0" skips="0"/>
  <OverallResultsCases successes="1" failures="0" expectedFailures="0" skips="0"/>
</Catch2TestRun>

There is also a JSON reporter, but therefore we need to update catch2 (only a new minor version): catchorg/Catch2#2706

sliwowitz · 2024-02-06T08:09:12Z

I'd vote for the JSON reporter as it could make the output both machine- and human-readable :-)

SimeonEhrig · 2024-02-06T08:13:51Z

I'd vote for the JSON reporter as it could make the output both machine- and human-readable :-)

In general, I also prefer JSON because it is more readable. But we should do at least a short test, if XML and JSON provides the same amount of information. For example, the XML output uses comments to store the information that the time was measured in nano seconds.

psychocoderHPC · 2024-04-15T09:15:51Z

JSON reporter is currently not working. It does not contain the benchmark results, the reporter is currently experimental and not fully implemented.

sliwowitz marked this pull request as draft May 12, 2022 15:33

sliwowitz force-pushed the benchmark branch from 6e4fd58 to d386357 Compare May 12, 2022 15:35

psychocoderHPC added Type:Testing Type:Enhancement labels May 13, 2022

sliwowitz force-pushed the benchmark branch from c426762 to d23e2cb Compare May 18, 2022 12:33

bernhardmgruber reviewed May 19, 2022

View reviewed changes

sliwowitz force-pushed the benchmark branch from d23e2cb to 0c35454 Compare November 15, 2022 12:54

sliwowitz mentioned this pull request Feb 5, 2024

Create benchmarks directory and move babelstream into it #2237

Merged

sliwowitz force-pushed the benchmark branch from 0c35454 to 47521b7 Compare February 5, 2024 15:20

Catch2 Benchmarking

37351aa

sliwowitz force-pushed the benchmark branch from 47521b7 to 37351aa Compare March 22, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Catch2 Benchmarking #1723

Draft: Catch2 Benchmarking #1723

sliwowitz commented May 12, 2022 •

edited

Loading

bernhardmgruber May 19, 2022

fwyzard May 19, 2022

bernhardmgruber May 19, 2022

sliwowitz commented May 19, 2022 •

edited

Loading

j-stephan commented Jun 21, 2023

sliwowitz commented Jun 21, 2023

SimeonEhrig commented Feb 6, 2024

sliwowitz commented Feb 6, 2024

SimeonEhrig commented Feb 6, 2024

psychocoderHPC commented Apr 15, 2024

	//! The fixture for executing a kernel on a given accelerator.
	//! The fixture for benchmarking the execution of a kernel on a given accelerator.

Draft: Catch2 Benchmarking #1723

Are you sure you want to change the base?

Draft: Catch2 Benchmarking #1723

Conversation

sliwowitz commented May 12, 2022 • edited Loading

bernhardmgruber May 19, 2022

Choose a reason for hiding this comment

fwyzard May 19, 2022

Choose a reason for hiding this comment

bernhardmgruber May 19, 2022

Choose a reason for hiding this comment

sliwowitz commented May 19, 2022 • edited Loading

j-stephan commented Jun 21, 2023

sliwowitz commented Jun 21, 2023

SimeonEhrig commented Feb 6, 2024

sliwowitz commented Feb 6, 2024

SimeonEhrig commented Feb 6, 2024

psychocoderHPC commented Apr 15, 2024

sliwowitz commented May 12, 2022 •

edited

Loading

sliwowitz commented May 19, 2022 •

edited

Loading