A collection of microbenchmarks demonstrating low-level concepts and optimisations that affect performance on modern x86 CPUs.
Each demonstration examines a single concept in isolation to make learning easier.
Warning: CPU microarchitecture ahead!
If you are new to the world of CPU architecture and microarchitecture, you may want to read Primer.md, which covers some basic concepts that are prerequisite knowledge for many of the demonstrations.
Some demonstrations inherently build on topics discussed in other demonstrations. You may want to try the demonstrations in this order to minimise confusion:
- Superscalar execution
- Out-of-order execution
- Branch prediction
- Indirect jump prediction
- Loop-carried dependencies
- Register renaming
- MOV elimination
- Zeroing idioms
- Macro-op fusion
- Cache efficiency
Enjoy!
Naturally, the exact results of microbenchmarks depend significantly on your CPU's microarchitecture - demonstrating microarchitecture in a microarchitecture-agnostic manner is difficult. Some factors that may contribute to differing results include:
- Feature is not implemented on all CPUs.
- Particular instruction latencies are assumed.
- A minimum amount of parallel execution capacity (execution ports) is assumed.
The demonstrations were written and tested with Intel x86-64 CPUs from Skylake onwards in mind. I have tried my best to indicate in each demonstration broadly which CPUs are supported and what assumptions are made.
In almost every demonstration's assembly code, you will see something like this:
.p2align 4 # Skylake JCC alignment issue (unimportant)
loop:
...
The .p2align
enforces memory address alignment on the start of the loop. This alignment ensures the loop's trailing jump instruction is placed correctly to avoid a performance pessimisation on some Intel CPUs (see Intel's paper for details).
Please ignore this issue - it does not affect the correctness of the demonstrations.