This will the last report for 2024.
Make requested revisions to Paolo's upstream patches.
- IN PROGRESS.
- This work is now being handled by Craig Blackmore.
- Following a question by Max Chou we have carried out detailed statistical analysis to verify that the new optimization is always beneficial and have slightly adjusted thresholds to particularly tune for
memcpy
. This is now posted upstream (see this mailing list post). See the detailed discussion below. - The second patch was posted a week ago requesting comments, but none have yet been received. The patch has been reposted to allow it to be reviewed for merge (see this mailing list post).
SiFive benchmarks.
- COMPLETE.
This is a patch which only benefits small loads and stores - it is disabled for larger loads and stores. As such it complements Max Chou's earlier patch, which benefits larger loads and stores. Max Chou raised a question on the revised first patch about whether it could slow down memcpy
for larger data sizes (see this mailing list post). We carried out extensive benchmarking and statistical analysis on a number of machines. The three points which became apparent were:
- the cut off point for using data sizes this optimization varies (in the range 6-10) depending on VLEN and the specific host platform being used;
- the speedup gained varies (in the range 30-60%) depending on the host platform being used.
- there is no statistically significant impact on performance for larger data sizes.
The comparison was using
- Baseline commit: 248f9209ed
- Patch commit: 594c0cb1ab
The following graphs show the result of carrying out 6 full performance runs for memcpy
using single threaded execution. The host was an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
The variation with change of processor can be seen from the following graphs. These are 15 short performance runs for memcpy
using single threaded executions. The host was an AMD Ryzen 7 7840HS.
Generate TCG Ops for vector whole word load/store.
- IN PROGRESS.
- We have updated the patch from last week to ensure
vstart
is correctly handled. The patch can be found in this mailing list post. The patch is particularly beneficial for large copies with small VLEN, yielding up to 10x speedup.
Improve first-fault handling for vector load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
Improve strided load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
There is no new general benchmark run this week. However the statistical analysis above reports detailed results for memcpy
and the latest version of the patch.
This uses a reimplementation of the memcpy
benchmark using whole word load/store. The source is in this GitHub fork of sifive-libc. Since this is the only benchmark which will benefit from this optimization, we run the benchmark just for memcpy
.
- Baseline commit: 8032c78e55
- Patch commit: 662f602a62
- See report-2024-12-19-13-42-32.pdf.
Not surprisingly we see particular benefit for copies that will not fit in a single vector register. The speedup (up to 10x) is most pronounced for small vectors. This reflects the current helper function implementation being most effective for large vector registers.
Jeremy Bennett, Craig Blackmore and Paolo Savini will be on vacation 23 December to 3 January.
Next meeting 8 January 2025.