fix bablestream benchmark #2420

mehmetyusufoglu · 2024-11-07T13:28:29Z

Some of the 5 kernels of Babelstream-benchmark were not connected to each other, with this change if one of them changed somehow and fails; error is cached in the last result. ( Since we don't check after each kernel run, this is needed to make sure all kernels are connected.)
Using arrays in a different order in calling different kernels might affect the performance (although not observed) due to caching, therefore using same arrays for the same kernels (as in the original babelstream of UoB) in the kernel call sequence is also done by above change.
An optional kernel is added, NStream. This can be run separately alone.
One of the 5 kernels of babelstream, the triad kernel, was optionally being run alone in the original code by UoB. This option is also added.

This PR is an extension of previous PR: #2299

New parameters and kernel calls with specific arrays in the kernel call sequence to avoid cache usage differences:

A = 0.1 B= 0.2 C= 0.0 scalar = 0.4
C = A // copy
B = scalar * C // mult
C = A + B // add
A = B + scalar * C // triad
Missing optional kernel NStream is added

Update November 25: ALL 5 Babelstream Kernels (copy add mul dot triad) run for all backends.

RESULTS

Randomness seeded to: 3331482523
Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccCpuSerial<1,unsigned int>
NumberOfRuns:2
Precision:single
DataSize(items):262144
DeviceName:13th Gen Intel(R) Core(TM) i7-1360P
WorkDivInit :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.00188882
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       23.732          0.00013255 0.00013255 0.00013255 3.1457 
 CopyKernel      23.401          8.9618e-05 8.9618e-05 8.9618e-05 2.0972 
 DotKernel       0.96404         0.0021754 0.0021754 0.0021754 2.0972 
 InitKernel      0.96656         0.0032545 0.0032545 0.0032545 3.1457 
 MultKernel      21.214          9.8855e-05 9.8855e-05 9.8855e-05 2.0972 
 TriadKernel     24.391          0.00012897 0.00012897 0.00012897 3.1457 

Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:2
Precision:single
DataSize(items):262144
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.00317802
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       72.282          4.352e-05 4.352e-05 4.352e-05 3.1457 
 CopyKernel      77.294          2.7132e-05 2.7132e-05 2.7132e-05 2.0972 
 DotKernel       30.834          6.8014e-05 6.8014e-05 6.8014e-05 2.0972 
 InitKernel      11.531          0.00027281 0.00027281 0.00027281 3.1457 
 MultKernel      64.534          3.2497e-05 3.2497e-05 3.2497e-05 2.0972 
 TriadKernel     74.74           4.2089e-05 4.2089e-05 4.2089e-05 3.1457 

Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccCpuSerial<1,unsigned int>
NumberOfRuns:2
Precision:double
DataSize(items):262144
DeviceName:13th Gen Intel(R) Core(TM) i7-1360P
WorkDivInit :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (262144), blockThreadExtent: (1), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.00384645
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       16.979          0.00037054 0.00037054 0.00037054 6.2915 
 CopyKernel      15.052          0.00027866 0.00027866 0.00027866 4.1943 
 DotKernel       1.208           0.0034721 0.0034721 0.0034721 4.1943 
 InitKernel      1.7377          0.0036205 0.0036205 0.0036205 6.2915 
 MultKernel      15.806          0.00026537 0.00026537 0.00026537 4.1943 
 TriadKernel     17.379          0.000362 0.000362 0.000362 6.2915 

Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:2
Precision:double
DataSize(items):262144
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.00536517
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       77.046          8.1658e-05 8.1658e-05 8.1658e-05 6.2915 
 CopyKernel      65.102          6.4427e-05 6.4427e-05 6.4427e-05 4.1943 
 DotKernel       33.015          0.00012704 0.00012704 0.00012704 4.1943 
 InitKernel      64.109          9.8137e-05 9.8137e-05 9.8137e-05 6.2915 
 MultKernel      74.419          5.6361e-05 5.6361e-05 5.6361e-05 4.1943 
 TriadKernel     77.08           8.1622e-05 8.1622e-05 8.1622e-05 6.2915 

===============================================================================
All tests passed (16 assertions in 4 test cases)

RUN FOR BENCHMARKING

./babelstream --array-size=33554432 --number-runs=100
Array size set to: 33554432
Number of runs provided: 100
Randomness seeded to: 1929971841
Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:100
Precision:single
DataSize(items):33554432
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.311451
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       90.574          0.0044456 0.0044824 0.0044742 402.65 
 CopyKernel      90.291          0.002973 0.0030168 0.0030108 268.44 
 DotKernel       92.872          0.0028904 0.0029612 0.002928 268.44 
 InitKernel      92.054          0.0043741 0.0043741 0.0043741 402.65 
 MultKernel      90.464          0.0029673 0.0030106 0.0030042 268.44 
 TriadKernel     90.758          0.0044366 0.0044949 0.0044788 402.65 

Kernels: Init, Copy, Mul, Add, Triad, Dot Kernels


AcceleratorType:AccGpuCudaRt<1,unsigned int>
NumberOfRuns:100
Precision:double
DataSize(items):33554432
DeviceName:NVIDIA RTX A500 Laptop GPU
WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivAdd  :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)}
WorkDivDot  :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)}
AccToHost Memcpy Time(sec):0.589239
Kernels         Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) 
 AddKernel       90.494          0.008899 0.008962 0.0089358 805.31 
 CopyKernel      88.791          0.0060464 0.0061075 0.0060708 536.87 
 DotKernel       93.117          0.0057656 0.0058216 0.005797 536.87 
 InitKernel      89.423          0.0090056 0.0090056 0.0090056 805.31 
 MultKernel      89.072          0.0060274 0.0061036 0.0060612 536.87 
 TriadKernel     90.582          0.0088903 0.0089994 0.0089574 805.31 

===============================================================================
All tests passed (8 assertions in 2 test cases)

psychocoderHPC · 2024-11-12T09:16:24Z

@mehmetyusufoglu Can you please check if CPU is working too.

benchmarks/babelstream/src/babelStreamMainTest.cpp

psychocoderHPC · 2024-11-13T08:51:33Z

benchmarks/babelstream/src/babelStreamMainTest.cpp


+            DataType const* sumPtr = std::data(bufHostSumPerBlock);
+            float const result = std::reduce(sumPtr, sumPtr + gridBlockExtent, 0.0f);


This and the memcpy has be be part of the measurement, because that's our fault that we not execute the full reduction on device.
To have a fair comparison the allocation of the result must be part of measureKernelExec too but the cuda upstream implementation is cheating here too so I assume this is fine to be allocated outside of measureKernelExec

ok, I saw these are also part of measurement. Thanks.

done, thanks. [reduce is taken into measure]

psychocoderHPC · 2024-11-13T09:17:51Z

benchmarks/babelstream/src/babelStreamCommon.hpp


    // Block thread extent for DotKernel test work division parameters.
    [[maybe_unused]] constexpr auto blockThreadExtentMain = 1024;
+    [[maybe_unused]] constexpr auto dotGridBlockExtent = 256;


feel free to add a extent for CPUs too to support CPU dot execution required for the verification.

Ok, now all backends are used and tested. [ AccCpuThreads backend is very slow, but passed the pipeline. ]

chillenzer

Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...; trick. Very nice indeed!

It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here:
Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSums in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.

Also, just for my information: Why is tbSum a reference in that same kernel? It very much looks like it must be dangling but if this compiles and runs correctly it apparently isn't?

chillenzer · 2024-11-13T08:56:59Z

benchmarks/babelstream/src/babelStreamMainTest.cpp

-        [&]()
-        { alpaka::exec<Acc>(queue, workDivTriad, TriadKernel(), bufAccInputAPtr, bufAccInputBPtr, bufAccOutputCPtr); },
-        "TriadKernel");
+    if(kernelsToBeExecuted == KernelsToRun::All || kernelsToBeExecuted == KernelsToRun::Triad)


Suggested change

if(kernelsToBeExecuted == KernelsToRun::All || kernelsToBeExecuted == KernelsToRun::Triad)

else if(kernelsToBeExecuted == KernelsToRun::Triad)

following https://github.com/UoB-HPC/BabelStream/blob/2f00dfb7f8b7cfe8c53d20d5c770bccbf8673440/src/main.cpp#L532

There is a code repetition in the original code, both run_triad and run_all calls the triad kernel. Here, i call the same code piece for both cases.

alpaka/benchmarks/babelstream/src/main.cpp

Line 108 in 7a8b205

std::vector<std::vector<double>> run_all(Stream<T>* stream, T& sum)

benchmarks/babelstream/src/babelStreamMainTest.cpp

chillenzer · 2024-11-18T12:35:07Z

benchmarks/babelstream/src/babelStreamMainTest.cpp

+    alpaka::exec<Acc>(
+        queue,
+        workDivInit,
+        InitKernel(),
+        bufAccInputAPtr,
+        bufAccInputBPtr,
+        bufAccOutputCPtr,
+        static_cast<DataType>(initA),
+        static_cast<DataType>(initB));


In the original, this call is already timed.

It is not one of babelstream kernels but ok i am implementing.

Yes, I agree that it makes moderate sense at best but it might be interesting information and it brings us closer to upstream. Your announced change is not yet found in the PR.

ok, implemented. Thanks.

AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:2 Precision:double DataSize(items):1048576 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (1024), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 87.201 0.0002886 0.0002886 0.0002886 25.166 CopyKernel 79.096 0.00021211 0.00021211 0.00021211 16.777 DotKernel 74.74 0.00022447 0.00022447 0.00022447 16.777 InitKernel 87.865 0.00028641 0.00028641 0.00028641 25.166 MultKernel 85.107 0.00019713 0.00019713 0.00019713 16.777 TriadKernel 87.046 0.00028911 0.00028911 0.00028911 25.166

benchmarks/babelstream/src/babelStreamMainTest.cpp

chillenzer · 2024-11-18T13:05:19Z

benchmarks/babelstream/src/babelStreamMainTest.cpp

    alpaka::memcpy(queue, bufHostOutputC, bufAccOutputC, arraySize);
    alpaka::memcpy(queue, bufHostOutputB, bufAccInputB, arraySize);
    alpaka::memcpy(queue, bufHostOutputA, bufAccInputA, arraySize);


These get timed int the original version.

Read time (actually copy time from Acc to Host for 3 arrays ) has been added to the output display as AccToHost Memcpy Time(sec). Thanks.

AcceleratorType:AccGpuCudaRt<1,unsigned int> NumberOfRuns:100 Precision:double DataSize(items):33554432 DeviceName:NVIDIA RTX A500 Laptop GPU WorkDivInit :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivCopy :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivMult :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivAdd :{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivTriad:{gridBlockExtent: (32768), blockThreadExtent: (1024), threadElemExtent: (1)} WorkDivDot :{gridBlockExtent: (256), blockThreadExtent: (1024), threadElemExtent: (1)} AccToHost Memcpy Time(sec):0.570856 Kernels Bandwidths(GB/s) MinTime(s) MaxTime(s) AvgTime(s) DataUsage(MB) AddKernel 90.665 0.0088822 0.0089985 0.0089376 805.31 CopyKernel 89.087 0.0060264 0.0061119 0.0060773 536.87 DotKernel 93.055 0.0057694 0.0058486 0.0058113 536.87 InitKernel 84.437 0.0095374 0.0095374 0.0095374 805.31 MultKernel 89.35 0.0060086 0.0060852 0.0060568 536.87 TriadKernel 90.222 0.0089258 0.0090338 0.0089565 805.31

benchmarks/babelstream/src/babelStreamMainTest.cpp

mehmetyusufoglu · 2024-11-18T16:56:08Z

Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...; trick. Very nice indeed!

It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSums in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.

Also, just for my information: Why is tbSum a reference in that same kernel? It very much looks like it must be dangling but if this compiles and runs correctly it apparently isn't?

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

mehmetyusufoglu · 2024-11-18T18:59:16Z

Hi, thanks for your update. I've mostly checked for compliance with upstream babelstream as that was apparently a major point of discussion. But I want to first applaud you for that nice auto [i] = getIdx...; trick. Very nice indeed!

It's mostly small things I've found some of which we might actively decide to do differently, e.g., not measuring some timings of the infrastructure-ish calls. There was one section that GH didn't allow me to comment on, so I'll put that in here: Technically speaking, the dot kernel does a slightly different thing by accumulating the threadSums in registers and only storing them once into shared memory. Our version vs. upstream version. Likely to be optimised away by the compiler because both versions leave full flexibility concerning memory ordering here but we can't be sure I believe.

Yes, this was the choice at the first implementation at our repo, i used directly like in the cuda implementation now. Checking the performance.

chillenzer · 2024-11-19T09:28:37Z

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

Thanks for the explanation! That makes sense.

Concerning the reduce implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.

mehmetyusufoglu · 2024-11-19T18:05:31Z

tbSum is reference because the function return type is -> T& and returns a dereferenced value return *data;

Thanks for the explanation! That makes sense.

Concerning the reduce implementation, I had an offline discussion with @psychocoderHPC: The concept to benchmark here is any implementation of a reduction based on alpaka. In that sense, we are not required to follow the reference implementation precisely. Not hammering on shared memory with every thread is probably a worthwhile change.

Ok I reverted it back. (Yes accessing shared memory at each thread many times is not needed at such case)

mehmetyusufoglu · 2024-11-25T09:59:38Z

Recent update by 25th Nov: All backends are run and results are controlled for all of them.

[That was not preferred due to CI runtime but now it takes less than the longest runtime example which is heat-equation.]

mehmetyusufoglu force-pushed the updateBabelStr branch from 3777a36 to 98aac95 Compare November 9, 2024 15:03

mehmetyusufoglu marked this pull request as draft November 10, 2024 20:03

mehmetyusufoglu force-pushed the updateBabelStr branch 8 times, most recently from 07496fa to d7982fb Compare November 11, 2024 14:08

mehmetyusufoglu force-pushed the updateBabelStr branch 2 times, most recently from 2bfcf11 to 3aa4303 Compare November 12, 2024 13:40

mehmetyusufoglu marked this pull request as ready for review November 12, 2024 13:42

mehmetyusufoglu force-pushed the updateBabelStr branch from 3aa4303 to 862a390 Compare November 12, 2024 13:53

psychocoderHPC requested changes Nov 13, 2024

View reviewed changes

psychocoderHPC reviewed Nov 13, 2024

View reviewed changes

mehmetyusufoglu marked this pull request as draft November 13, 2024 15:35

mehmetyusufoglu force-pushed the updateBabelStr branch 6 times, most recently from cf19b90 to e2b8eae Compare November 18, 2024 13:28

psychocoderHPC changed the title ~~Make kernel results depend each other directly~~ fix bablestream benchmark Nov 18, 2024

psychocoderHPC added the Type:Bug label Nov 18, 2024

psychocoderHPC added this to the 2.0.0 milestone Nov 18, 2024

chillenzer suggested changes Nov 18, 2024

View reviewed changes

mehmetyusufoglu force-pushed the updateBabelStr branch from e2b8eae to 254dc3d Compare November 18, 2024 15:46

mehmetyusufoglu force-pushed the updateBabelStr branch 2 times, most recently from e014dca to efbd007 Compare November 18, 2024 18:56

mehmetyusufoglu force-pushed the updateBabelStr branch 4 times, most recently from 7ec9872 to dd2b920 Compare November 19, 2024 18:02

mehmetyusufoglu force-pushed the updateBabelStr branch 7 times, most recently from 30cb8e8 to ce889b1 Compare November 24, 2024 23:02

make kernels depend each other, use original variables and access order

b1cb527

mehmetyusufoglu force-pushed the updateBabelStr branch from ce889b1 to b1cb527 Compare November 25, 2024 09:47

mehmetyusufoglu marked this pull request as ready for review November 25, 2024 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bablestream benchmark #2420

fix bablestream benchmark #2420

mehmetyusufoglu commented Nov 7, 2024 •

edited

Loading

psychocoderHPC commented Nov 12, 2024

psychocoderHPC Nov 13, 2024

mehmetyusufoglu Nov 13, 2024

mehmetyusufoglu Nov 13, 2024 •

edited

Loading

psychocoderHPC Nov 13, 2024

mehmetyusufoglu Nov 25, 2024

chillenzer left a comment

chillenzer Nov 13, 2024

mehmetyusufoglu Nov 18, 2024

chillenzer Nov 18, 2024

mehmetyusufoglu Nov 18, 2024

chillenzer Nov 19, 2024

mehmetyusufoglu Nov 19, 2024

chillenzer Nov 18, 2024

mehmetyusufoglu Nov 19, 2024

mehmetyusufoglu commented Nov 18, 2024

mehmetyusufoglu commented Nov 18, 2024

chillenzer commented Nov 19, 2024

mehmetyusufoglu commented Nov 19, 2024

mehmetyusufoglu commented Nov 25, 2024


		DataType const* sumPtr = std::data(bufHostSumPerBlock);
		float const result = std::reduce(sumPtr, sumPtr + gridBlockExtent, 0.0f);

	if(kernelsToBeExecuted == KernelsToRun::All \|\| kernelsToBeExecuted == KernelsToRun::Triad)
	else if(kernelsToBeExecuted == KernelsToRun::Triad)

fix bablestream benchmark #2420

Are you sure you want to change the base?

fix bablestream benchmark #2420

Conversation

mehmetyusufoglu commented Nov 7, 2024 • edited Loading

psychocoderHPC commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehmetyusufoglu Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chillenzer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehmetyusufoglu commented Nov 18, 2024

mehmetyusufoglu commented Nov 18, 2024

chillenzer commented Nov 19, 2024

mehmetyusufoglu commented Nov 19, 2024

mehmetyusufoglu commented Nov 25, 2024

mehmetyusufoglu commented Nov 7, 2024 •

edited

Loading

mehmetyusufoglu Nov 13, 2024 •

edited

Loading