Skip to content

Micro-benchmarks demonstrating low-level concepts that affect performance.

License

Notifications You must be signed in to change notification settings

MC-DeltaT/cpu-performance-demos

Repository files navigation

CPU Performance Demonstrations

A collection of microbenchmarks demonstrating low-level concepts and optimisations that affect performance on modern x86 CPUs.
Each demonstration examines a single concept in isolation to make learning easier.

Warning: CPU microarchitecture ahead!

How To Start

If you are new to the world of CPU architecture and microarchitecture, you may want to read Primer.md, which covers some basic concepts that are prerequisite knowledge for many of the demonstrations.

Some demonstrations inherently build on topics discussed in other demonstrations. You may want to try the demonstrations in this order to minimise confusion:

  1. Superscalar execution
  2. Out-of-order execution
  3. Branch prediction
  4. Indirect jump prediction
  5. Loop-carried dependencies
  6. Register renaming
  7. MOV elimination
  8. Zeroing idioms
  9. Macro-op fusion
  10. Cache efficiency

Enjoy!

Notes

Performance disclaimer

Naturally, the exact results of microbenchmarks depend significantly on your CPU's microarchitecture - demonstrating microarchitecture in a microarchitecture-agnostic manner is difficult. Some factors that may contribute to differing results include:

  • Feature is not implemented on all CPUs.
  • Particular instruction latencies are assumed.
  • A minimum amount of parallel execution capacity (execution ports) is assumed.

The demonstrations were written and tested with Intel x86-64 CPUs from Skylake onwards in mind. I have tried my best to indicate in each demonstration broadly which CPUs are supported and what assumptions are made.

What's this "Skylake JCC alignment issue"?

In almost every demonstration's assembly code, you will see something like this:

.p2align 4      # Skylake JCC alignment issue (unimportant)
loop:
    ...

The .p2align enforces memory address alignment on the start of the loop. This alignment ensures the loop's trailing jump instruction is placed correctly to avoid a performance pessimisation on some Intel CPUs (see Intel's paper for details).
Please ignore this issue - it does not affect the correctness of the demonstrations.

About

Micro-benchmarks demonstrating low-level concepts that affect performance.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published