-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] mir-glas is slower than OpenBLAS for DGEMM #20
Comments
I have llvm version 3.9.1 installed. |
Hey @MigMuc,
No, only cblas_dgemm CBLAS function are called. I have never tested GLAS on AMD CPUs. Would be awesome to have benchmarks for AMD. Benchmarks can be posted in the blog https://github.com/libmir/blog. Is AMD FX(TM)-4300 @ 3.8 GHz your CPU? Possible factors that may influence performance:
Lets start with computation kernels to optimize GLAS. OpenBLAS uses sgemm_kernel_16x2_piledriver. This is strange because this kernel do not use YMM registers, only XMM registers. Maybe Piledriver YMM are simulated on top of XMM? To see GLAS DGEMM kernel comile this gist with Thanks! |
Hi @9il,
Yes, it has a Piledriver core.
with
I got the following result: |
Please replace float with double for |
Can one use mir-glas on Windows for C \ C++ Projects using Visual Studio? |
@RoyiAvital, yes. It has C headers. Note, that it is single thread for now. |
@9il , Is there a guide or examples how to use it from C Code under Windows? Thank You. |
See also |
@MigMuc, Could you please add label for the axis? Thank You. |
As you can see the performance varies quite a bit, specially AMDs own ACML is really weak on single complex performance, where GLAS is the best. But there are two cases where GLAS could be substantially improved, i.e. for single and double precision cases. Regarding the implementation of gemm in GLAS as far as I can see there are a few lines in auto re = s[0] * reg[n][0][m]; Is this the 1m implementation from BLIS for complex arithmetic? |
Any chance having Intel MKL there as well? Thank You. |
This is an AMD CPU so I guess Intel MKL would not be optimized for this case. Probably it would work on this machine but I don't have MKL installed. |
@RoyiAvital: BTW, do you have any benchmarks you could provide? It would be great to have some comparisons also with Intel CPUs as well. |
I have done some Intel MKL vs. OpenBLAS using MATLAB and Julia. Have a look at Benchmark MATLAB & Julia for Matrix Operations. But now I'm mostly interested in small matrices (Up to ~1000 elements) performance. |
Some time ago I did some benchmark testing with gemm. I would like to debug the gemm_example.d in the examples folder in order to know the blocking sizes of this particular CPU as caclulated from the mir-cpuid packge and compare them with the blocking sizes of OpenBLAS and BLIS. Therefore I changed the build type from --build=target-native to --build=debug in the dub.json file. But then I get linker errors:
|
GLAS building system was created with assumption that it always builds in release mode. Half of files just never compiles because of all functions are marked as always inlined. I recommend to use C's printf to find the required information or fix the build configuration to compile and link required files. |
I suceesfully compiled the benchmark
gemm_report.d
provided by mir-glas. I ran it twice.One comparing with OpenBLAS and another comparing against ACML-5.3.1.
As you can see from the benchmarks mir-glas does not yield full performance for large matrices.
Peak performance for my machine is about 23 GFLOPs for double precision.
But also ACML does noch achieve full performance.
So I decided to compare with dgemm.goto and dgemm.acml benchmark programs provided in
OpenBLAS/benchmark
. Here ACML reaches peak performance too. Is there any overhead callingACML from D?
The text was updated successfully, but these errors were encountered: