RISE RP005 QEMU weekly report 2024-12-18

This will the last report for 2024.

Milestone 2

Make requested revisions to Paolo's upstream patches.

IN PROGRESS.
This work is now being handled by Craig Blackmore.
Following a question by Max Chou we have carried out detailed statistical analysis to verify that the new optimization is always beneficial and have slightly adjusted thresholds to particularly tune for memcpy. This is now posted upstream (see this mailing list post). See the detailed discussion below.
The second patch was posted a week ago requesting comments, but none have yet been received. The patch has been reposted to allow it to be reviewed for merge (see this mailing list post).

SiFive benchmarks.

COMPLETE.

Patch 1 details

This is a patch which only benefits small loads and stores - it is disabled for larger loads and stores. As such it complements Max Chou's earlier patch, which benefits larger loads and stores. Max Chou raised a question on the revised first patch about whether it could slow down memcpy for larger data sizes (see this mailing list post). We carried out extensive benchmarking and statistical analysis on a number of machines. The three points which became apparent were:

the cut off point for using data sizes this optimization varies (in the range 6-10) depending on VLEN and the specific host platform being used;
the speedup gained varies (in the range 30-60%) depending on the host platform being used.
there is no statistically significant impact on performance for larger data sizes.

The comparison was using

Baseline commit: 248f9209ed
Patch commit: 594c0cb1ab

The following graphs show the result of carrying out 6 full performance runs for memcpy using single threaded execution. The host was an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

standard library
VLEN=128
VLEN=1024

The variation with change of processor can be seen from the following graphs. These are 15 short performance runs for memcpy using single threaded executions. The host was an AMD Ryzen 7 7840HS.

standard library
VLEN=128
VLEN=1024

Milestone 3

Generate TCG Ops for vector whole word load/store.

IN PROGRESS.
We have updated the patch from last week to ensure vstart is correctly handled. The patch can be found in this mailing list post. The patch is particularly beneficial for large copies with small VLEN, yielding up to 10x speedup.

Improve first-fault handling for vector load/store helper functions.

IN PROGRESS.
No new work to report this week.

Improve strided load/store helper functions.

IN PROGRESS.
No new work to report this week.

Statistics

SiFive benchmarks: Patch 1 comparison

There is no new general benchmark run this week. However the statistical analysis above reports detailed results for memcpy and the latest version of the patch.

SiFive benchmarks: TCGOp generation for memcpy (modified)

This uses a reimplementation of the memcpy benchmark using whole word load/store. The source is in this GitHub fork of sifive-libc. Since this is the only benchmark which will benefit from this optimization, we run the benchmark just for memcpy.

Baseline commit: 8032c78e55
Patch commit: 662f602a62
See report-2024-12-19-13-42-32.pdf.

Not surprisingly we see particular benefit for copies that will not fit in a single vector register. The speedup (up to 10x) is most pronounced for small vectors. This reflects the current helper function implementation being most effective for large vector registers.

Actions

Other

Jeremy Bennett, Craig Blackmore and Paolo Savini will be on vacation 23 December to 3 January.

Next meeting 8 January 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20241218.md

20241218.md

RISE RP005 QEMU weekly report 2024-12-18

Milestone 2

Patch 1 details

Milestone 3

Statistics

SiFive benchmarks: Patch 1 comparison

SiFive benchmarks: TCGOp generation for memcpy (modified)

Actions

Other

Files

20241218.md

Latest commit

History

20241218.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-12-18

Milestone 2

Patch 1 details

Milestone 3

Statistics

SiFive benchmarks: Patch 1 comparison

SiFive benchmarks: TCGOp generation for memcpy (modified)

Actions

Other