[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

MigMuc · 2017-04-01T00:02:58Z

I suceesfully compiled the benchmark gemm_report.d provided by mir-glas. I ran it twice.
One comparing with OpenBLAS and another comparing against ACML-5.3.1.
As you can see from the benchmarks mir-glas does not yield full performance for large matrices.
Peak performance for my machine is about 23 GFLOPs for double precision.
But also ACML does noch achieve full performance.
So I decided to compare with dgemm.goto and dgemm.acml benchmark programs provided in
OpenBLAS/benchmark. Here ACML reaches peak performance too. Is there any overhead calling
ACML from D?

The text was updated successfully, but these errors were encountered:

MigMuc · 2017-04-01T00:05:42Z

I have llvm version 3.9.1 installed.

9il · 2017-04-01T03:45:58Z

Hey @MigMuc,

Is there any overhead calling ACML from D?

No, only cblas_dgemm CBLAS function are called.

I have never tested GLAS on AMD CPUs. Would be awesome to have benchmarks for AMD. Benchmarks can be posted in the blog https://github.com/libmir/blog.

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Possible factors that may influence performance:

Computation kernel structure.
CPU Cache usage by BLAS and other programs. You may want to close web browser and other programs to get correct benchmarks.
Matrix transposition.
Strange thermal behaviour.

Lets start with computation kernels to optimize GLAS.

OpenBLAS uses sgemm_kernel_16x2_piledriver. This is strange because this kernel do not use YMM registers, only XMM registers. Maybe Piledriver YMM are simulated on top of XMM?

To see GLAS DGEMM kernel comile this gist with -output-s flag. Command line example is in the first line. The example is for SGEMM, replace float[8] with double[4] to generate DGEMM kernel.

Thanks!

MigMuc · 2017-04-01T12:20:36Z

Hi @9il,

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Yes, it has a Piledriver core.
So in order to compile the gemm_micro_kernel.d I used the -mcpu=bdver2 flag after exchanging

11 
12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(float[8])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(float[8])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }
22

with

11 
12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(double[4])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(double[4])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }
22

I got the following result:

gemm_micro_kernel.s.txt

9il · 2017-04-01T14:26:34Z

Please replace float with double for b

MigMuc · 2017-04-01T19:42:25Z

gemm_micro_kernel.s.txt

RoyiAvital · 2017-06-05T19:23:25Z

Can one use mir-glas on Windows for C \ C++ Projects using Visual Studio?

9il · 2017-06-08T01:33:33Z

@RoyiAvital, yes. It has C headers. Note, that it is single thread for now.

RoyiAvital · 2017-06-08T17:19:25Z

@9il ,
I'm interested in Small Matrices Linear Algebra library.
Hence I'm OK, for now, with Single Threaded implementation.

Is there a guide or examples how to use it from C Code under Windows?

Thank You.

9il · 2017-06-09T05:27:07Z

@RoyiAvital ,

Build the library using dub package manager https://github.com/libmir/mir-glas#manual-compilation.
Include it into your project as common C library.
Include headers https://github.com/libmir/mir-glas/tree/master/include/glas into your project.

See also examples folder.

MigMuc · 2017-06-18T17:55:14Z

I spent some time doing benchmark tests and here they are:

RoyiAvital · 2017-06-18T18:00:09Z

@MigMuc, Could you please add label for the axis?
I'm not sure if higher or lower is better.

Thank You.

MigMuc · 2017-06-18T18:11:53Z

As you can see the performance varies quite a bit, specially AMDs own ACML is really weak on single complex performance, where GLAS is the best. But there are two cases where GLAS could be substantially improved, i.e. for single and double precision cases.

Regarding the implementation of gemm in GLAS as far as I can see there are a few lines in glas/internal/gemm.d

auto re = s[0] * reg[n][0][m];
auto im = s[0] * reg[n][1][m];
re -= s[1] * reg[n][1][m];
im += s[1] * reg[n][0][m];
reg[n][0][m] = re;
reg[n][1][m] = im;

Is this the 1m implementation from BLIS for complex arithmetic?
I would like to test some blocking parameters, for example testing the blocking like in
https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_6x4_piledriver.S. Where can I set these parameters? Do you have any sugesstions about how to proceed?

RoyiAvital · 2017-06-18T18:28:46Z

Any chance having Intel MKL there as well?

Thank You.

MigMuc · 2017-06-18T18:35:01Z

This is an AMD CPU so I guess Intel MKL would not be optimized for this case. Probably it would work on this machine but I don't have MKL installed.

MigMuc · 2017-06-18T18:36:31Z

@RoyiAvital: BTW, do you have any benchmarks you could provide? It would be great to have some comparisons also with Intel CPUs as well.

RoyiAvital · 2017-06-18T19:12:43Z

I have done some Intel MKL vs. OpenBLAS using MATLAB and Julia.

Have a look at Benchmark MATLAB & Julia for Matrix Operations.

But now I'm mostly interested in small matrices (Up to ~1000 elements) performance.

MigMuc · 2017-09-18T01:18:22Z

Some time ago I did some benchmark testing with gemm. I would like to debug the gemm_example.d in the examples folder in order to know the blocking sizes of this particular CPU as caclulated from the mir-cpuid packge and compare them with the blocking sizes of OpenBLAS and BLIS. Therefore I changed the build type from --build=target-native to --build=debug in the dub.json file. But then I get linker errors:

The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-algorithm 0.6.13: target for configuration "library" is up to date.
mir-cpuid 0.5.2: target for configuration "library" is up to date.
gemm_example ~master: building configuration "application"...
Running pre-build commands...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-glas 0.2.3: building configuration "static"...
Compiling ../source/glas/precompiled/context.d...
Compiling ../source/glas/precompiled/l1d.d...
Compiling ../source/glas/precompiled/l1s.d...
Compiling ../source/glas/precompiled/l1c.d...
Compiling ../source/glas/precompiled/l1z.d...
Compiling ../source/glas/precompiled/l3c.d...
Compiling ../source/glas/precompiled/l3d.d...
Compiling ../source/glas/precompiled/l3s.d...
Compiling ../source/glas/precompiled/l3z.d...
Compiling ../source/glas/precompiled/utility.d...
Linking...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "release-nobounds" build using ldmd2 for x86_64.
mir-cpuid 0.5.2: building configuration "library"...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/amd.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/common.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/unified.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/intel.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/x86_any.d...
Linking...
Linking...
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas//libmir-glas.a(../.dub/build/static-debug-linux.posix-x86_64-ldc_2074-68AAD8DD4EB442FD2FE09072820FEAE2/home.miguel.Dokumente.DLang.mir-glas-0.2.3.mir-glas.source.glas.precompiled.context.d.o): In Funktion »_D4glas11precompiled7context6memoryFNbNimZAv«:
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:120: Warnung: undefinierter Verweis auf »_D4glas8internal6memory10deallocateFNbNiAvZb«
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:121: Warnung: undefinierter Verweis auf »_D4glas8internal6memory15alignedAllocateFNbNiNemkZAv«
collect2: Fehler: ld gab 1 als Ende-Status zurück
Error: /usr/bin/gcc failed with status: 1
ldmd2 failed with exit code 1.``



What can I do in order to compile the whole package with debug info?

9il · 2017-09-18T06:02:04Z

GLAS building system was created with assumption that it always builds in release mode. Half of files just never compiles because of all functions are marked as always inlined.

I recommend to use C's printf to find the required information or fix the build configuration to compile and link required files.

9il added the performance label Apr 1, 2017

9il changed the title ~~mir-glas is slower than OpenBLAS for DGEMM~~ [AMD] mir-glas is slower than OpenBLAS for DGEMM Apr 11, 2017

MigMuc closed this as completed Jun 18, 2017

MigMuc reopened this Jun 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

MigMuc commented Apr 1, 2017

MigMuc commented Apr 1, 2017

9il commented Apr 1, 2017 •

edited

Loading

MigMuc commented Apr 1, 2017

9il commented Apr 1, 2017

MigMuc commented Apr 1, 2017

RoyiAvital commented Jun 5, 2017

9il commented Jun 8, 2017

RoyiAvital commented Jun 8, 2017

9il commented Jun 9, 2017

MigMuc commented Jun 18, 2017 •

edited

Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Jun 18, 2017 •

edited

Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Jun 18, 2017

MigMuc commented Jun 18, 2017 •

edited

Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Sep 18, 2017

9il commented Sep 18, 2017 •

edited

Loading

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

Comments

MigMuc commented Apr 1, 2017

MigMuc commented Apr 1, 2017

9il commented Apr 1, 2017 • edited Loading

MigMuc commented Apr 1, 2017

9il commented Apr 1, 2017

MigMuc commented Apr 1, 2017

RoyiAvital commented Jun 5, 2017

9il commented Jun 8, 2017

RoyiAvital commented Jun 8, 2017

9il commented Jun 9, 2017

MigMuc commented Jun 18, 2017 • edited Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Jun 18, 2017 • edited Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Jun 18, 2017

MigMuc commented Jun 18, 2017 • edited Loading

RoyiAvital commented Jun 18, 2017

MigMuc commented Sep 18, 2017

9il commented Sep 18, 2017 • edited Loading

9il commented Apr 1, 2017 •

edited

Loading

MigMuc commented Jun 18, 2017 •

edited

Loading

MigMuc commented Jun 18, 2017 •

edited

Loading

MigMuc commented Jun 18, 2017 •

edited

Loading

9il commented Sep 18, 2017 •

edited

Loading