Skip to content

Latest commit

 

History

History
77 lines (50 loc) · 4.85 KB

20241218.md

File metadata and controls

77 lines (50 loc) · 4.85 KB

RISE RP005 QEMU weekly report 2024-12-18

This will the last report for 2024.

Milestone 2

Make requested revisions to Paolo's upstream patches.

  • IN PROGRESS.
  • This work is now being handled by Craig Blackmore.
  • Following a question by Max Chou we have carried out detailed statistical analysis to verify that the new optimization is always beneficial and have slightly adjusted thresholds to particularly tune for memcpy. This is now posted upstream (see this mailing list post). See the detailed discussion below.
  • The second patch was posted a week ago requesting comments, but none have yet been received. The patch has been reposted to allow it to be reviewed for merge (see this mailing list post).

SiFive benchmarks.

  • COMPLETE.

Patch 1 details

This is a patch which only benefits small loads and stores - it is disabled for larger loads and stores. As such it complements Max Chou's earlier patch, which benefits larger loads and stores. Max Chou raised a question on the revised first patch about whether it could slow down memcpy for larger data sizes (see this mailing list post). We carried out extensive benchmarking and statistical analysis on a number of machines. The three points which became apparent were:

  • the cut off point for using data sizes this optimization varies (in the range 6-10) depending on VLEN and the specific host platform being used;
  • the speedup gained varies (in the range 30-60%) depending on the host platform being used.
  • there is no statistically significant impact on performance for larger data sizes.

The comparison was using

The following graphs show the result of carrying out 6 full performance runs for memcpy using single threaded execution. The host was an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

The variation with change of processor can be seen from the following graphs. These are 15 short performance runs for memcpy using single threaded executions. The host was an AMD Ryzen 7 7840HS.

Milestone 3

Generate TCG Ops for vector whole word load/store.

  • IN PROGRESS.
  • We have updated the patch from last week to ensure vstart is correctly handled. The patch can be found in this mailing list post. The patch is particularly beneficial for large copies with small VLEN, yielding up to 10x speedup.

Improve first-fault handling for vector load/store helper functions.

  • IN PROGRESS.
  • No new work to report this week.

Improve strided load/store helper functions.

  • IN PROGRESS.
  • No new work to report this week.

Statistics

SiFive benchmarks: Patch 1 comparison

There is no new general benchmark run this week. However the statistical analysis above reports detailed results for memcpy and the latest version of the patch.

SiFive benchmarks: TCGOp generation for memcpy (modified)

This uses a reimplementation of the memcpy benchmark using whole word load/store. The source is in this GitHub fork of sifive-libc. Since this is the only benchmark which will benefit from this optimization, we run the benchmark just for memcpy.

Not surprisingly we see particular benefit for copies that will not fit in a single vector register. The speedup (up to 10x) is most pronounced for small vectors. This reflects the current helper function implementation being most effective for large vector registers.

Actions

Other

Jeremy Bennett, Craig Blackmore and Paolo Savini will be on vacation 23 December to 3 January.

Next meeting 8 January 2025.