Update 2018:

Comparing the performance of Intel's cmov instruction with a conditional branch + mov pair.

Suppose you want to write the following code inside a performance critical loop:

    if(cond) {
        foo = bar;
    }

where foo is a local variable (of some simple type, like int), and bar is either a local variable, or a simple memory location (such as A[i]). Such a code is likely to be found for example in a Kadane's algorithm implementation.

There are two standard ways for an x86-64 compiler to translate this conditional into Intel assembly, namely a cmov instruction, or a conditional branch paired with a regular mov.

To make this example more concrete I will focus on this particular code:

    int j;
    int res = 0;
    for(j = 0; j < n; ++j) {
        if(A[j]) {
            res = j;
        }
    }

GCC version 4.8.2 emits the following code for the if at -O3 (gas' assembly notation):

    movl    (%rdi,%rcx,4), %r9d
    testl   %r9d, %r9d
    cmovne  %ecx, %eax

Clearly the compiler stores the location of A in %rdi, the value of j in %rcx (%ecx), and res in %rax (%eax).

We may however ask it to forget about the cmov instruction by using the following command line parameters in addition to -O3: -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-if-conversion -fno-if-conversion2. Then we will obtain the following assembly:

    movl    (%rdi,%rcx,4), %r9d
    testl   %r9d, %r9d
    je  .L4
    movl    %ecx, %eax
.L4:

To make this example more interesting let's feed the compiler with more data. GCC (and others probably too) implements a mechanism allowing the programmer to specify that a condition will likely/unlikely be true during the runtime. In our case it would look like this (assuming that we are expecting A[j] to be generally false):

    int j;
    int res = 0;
    for(j = 0; j < n; ++j) {
        if(__builtin_expect(A[j], 0)) {
            res = j;
        }
    }

The corresponding GCC output:

    movl    (%rdi,%rcx,4), %r9d
    testl   %r9d, %r9d
    jne .L20
.L14:
    /* Other code */
.L20:
    movl    %ecx, %eax
    jmp .L14

We are of course free to put 1 in the place of 0 whenever we expect A[j] to be true most of the time:

    int j;
    int res = 0;
    for(j = 0; j < n; ++j) {
        if(__builtin_expect(A[j], 1)) {
            res = j;
        }
    }

And the corresponding GCC output (note the speculative movl execution!):

    movl    %eax, %r8d
    movl    (%rdi,%rcx,4), %r10d
    movl    %ecx, %eax
    testl   %r10d, %r10d
    je  .L29
.L24:
    /* Other code */
.L29:
    movl    %r8d, %eax
    jmp .L24

This way we obtain four versions of the program which I will call cmov, branch, branch_0 and branch_1. One might ask why there is no cmov_0 and no cmov_1, but it is due to the fact that in cmov's case the compiler ignored the likely/unlikely hints.

The point of this excercise is to measure their execution speed, but this of course depends on the contents of the A array. So let's fill it with random independent identically distributed zeros and ones, manipulating the probability of one (the probability that the if's condition will be true).

Update 2018:

Thanks to AndreasPK for providing this data.

On an i7-6700k (Skylake) cmov is now faster under all circumstances:

While the performance at the extremes looks the same cmov is still faster which becomes obvious when looking at the raw numbers:

0.000000;0.983000;0.412000;0.550000;0.411000
...
1.000000;0.747000;0.951000;0.591000;0.398000

branch_0 comes (very) close for probability zero but was still always slightly slower than cmov in 10 tests I ran.

Original Results

These are the results I obtained on my machine. It is the execution time, so the lower the better:

Three interesting facts can be read from this plot:

The cmov instruction is generally faster. It is also immune to the probability
However the branch version is faster for stable, predictable conditions.
The branch versions take approximately p*(1-p) time to execute. The nice parabola-alike looks is approximately the variance of the distribution that A was sampled from.

The "visual proof" for the third fact:

To conclude, GCC emits cmov by default, and it seems like a reasonable choice, at least on my Intel Core i5. However, as noted before, it is not the optimal solution for stable, predictable conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
output		output
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.c		main.c
plot.gnuplot		plot.gnuplot
test_base.h		test_base.h
test_branch.c		test_branch.c
test_cmov.c		test_cmov.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing the performance of Intel's cmov instruction with a conditional branch + mov pair.

Update 2018:

Original Results

About

Releases

Packages

Contributors 2

Languages

License

marcin-osowski/cmov

Folders and files

Latest commit

History

Repository files navigation

Comparing the performance of Intel's cmov instruction with a conditional branch + mov pair.

Update 2018:

Original Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages