Implementation suggestion for complex multiplication to improve performance #91

Fulguritude · 2021-07-04T16:13:42Z

The current implementation for complex multiplication uses the "naive" algorithm, see lib.rs lines 683-692.

// (a + i b) * (c + i d) == (a*c - b*d) + i (a*d + b*c)
impl<T: Clone + Num> Mul<Complex<T>> for Complex<T> {
    type Output = Self;

    #[inline]
    fn mul(self, other: Self) -> Self::Output {
        let re = self.re.clone() * other.re.clone() - self.im.clone() * other.im.clone();
        let im = self.re * other.im + self.im * other.re;
        Self::Output::new(re, im)
    }
}

There exists a slightly faster alternative, a variation on the Karatsuba algorithm, for complex numbers. This algorithm works using only 3 multiplications instead of 4, but 5 additions/subtractions instead of 2. On platforms where addition and multiplication cost the same, it might not be an improvement, but I've consistently seen this algorithm show improved performance on various platforms in C (though I'm not familiar with the Rust compiler, so you might want to test things with some benchmarks).

This is the algorithm's pseudocode:

    For z =  a + ib and z' = c + id 
    k1 = c · (a + b)
    k2 = a · (d − c)
    k3 = b · (c + d)
    Re(zz') = k1 − k3
    Im(zz') = k1 + k2

source: https://en.wikipedia.org/wiki/Multiplication_algorithm#Complex_multiplication_algorithm

This would correspond, if I'm not mistaken (I'm currently learning Rust; drawing fractals, which is why I noticed this possible performance improvement; but am not sure about where to .clone() or not, so I put it everywhere), to the following code:

    #[inline]
    fn mul(self, other: Self) -> Self::Output {
        let k1 = other.re.clone() * (self.re.clone() + self.im.clone());
        let k2 = self.re.clone() * (other.im.clone() - other.re.clone());
        let k3 = self.im.clone() * (other.re.clone() + other.im.clone());
        let re = k1 - k3;
        let im = k1 + k2;
        Self::Output::new(re, im)
    }

The text was updated successfully, but these errors were encountered:

cuviper · 2021-07-08T23:00:18Z

A few thoughts:

k2 = a · (d − c) may be negative even when the result will only have positive parts.
- That's potentially an issue for Complex<T> with unsigned T, though I don't know if anyone realistically uses that.
- This could be avoided by branching: if c <= d { /* normal k2 */ } else { /* flip to c-d and then subtract k2 */ }
More generally, this changes the overflow conditions, but it's not necessarily better or worse.
- i.e. some inputs will overflow intermediate values in one algorithm but not the other, in both directions.
- That would mostly affect Gaussian integers, but floating-point Complex could also be near infinity or zero-underflow.
- Maybe this warrants waiting for a breaking-change release? Or maybe nobody would notice...
Do you have any code which can be benchmarked with this change?

not sure about where to .clone() or not, so I put it everywhere

It's reasonable to start with that everywhere, and then you can remove the clone on the last use of a given value to instead consume it. So in your code k3 doesn't need any cloning, and k2 doesn't need to clone self.re.

Fulguritude · 2021-07-09T14:38:38Z

That's potentially an issue for Complex<T> with unsigned T, though I don't know if anyone realistically uses that.

Wouldn't this also be an issue with the current multiplication ? ie, i^2 = -1 would cause a problem for any complex multiplication algorithm for any unsigned (and non-cyclic) T. But in any case, I think your solution works.

More generally, this changes the overflow conditions, but it's not necessarily better or worse.

I'm really not an expert on generic Rust (though I'd very much like to become one :D) so I don't feel confident giving you my input here.

Do you have any code which can be benchmarked with this change?

I don't have any such benchmarks in Rust, but if you tell me it's relevant, I can dig through my files and adapt some old C of mine this weekend, and provide you the source so you can run the benchmark. Or I can also try to fork this repo and make the changes, but I'm really not sure how to set that up such a fork or dependency with cargo. Otherwise, I could try to figure out how to use the system clock in Rust. The simplest thing is just to:

generate a random array of complex numbers,
start a timer,
run a defined amount of multiplications on this list with protocol 1,
end the timer: mark the operation time,
start another timer,
run the same multiplications on this list with protocol 2,
end the timer: mark the operation time,
compare the times

I can do that in C for you right now if you'd like.

It's reasonable to start with that everywhere, and then you can remove the clone on the last use of a given value to instead consume it. So in your code k3 doesn't need any cloning, and k2 doesn't need to clone self.re.

Thanks for the helpful advice !! Given that logic, I suppose I forgot to clone k1 in my line let re = k1 - k3; ?

twirrim · 2021-07-14T16:35:01Z

Not sure if this is useful, but a hopefully fairly straightforward benchmark for complex number multiplication could come from a straightforward fractal loop:

#[bench]
fn bench_mul(b: &mut Bencher) {
    // This leverages part of the julia set to benchmark multiplication
    let c = Complex {
        re: -0.7,
        im: 0.27015,
    };
    b.iter(|| {
        let mut i = 0;
        let mut z = Complex {
            re: -0.8099464512195294,
            im: 0.07329678182974636,
        };
        while z.norm() <= 2.0 && i < 255 {
            z = z * z + c;
            i += 1;
        }
    });
}

I tweaked some rust-based julia fractal code I've been fiddling with recently to spit me out some example values where it had to do lots of iteration, to get me some useful values for c and z. That seems to result in sufficient multiplications as to provide a measurement of performance. I can easily influence the performance quite drastically by deliberately breaking the mul implementation doing things like making let im = self.re * other.im + self.im * other.re; just let im = self.im;

Here's the benchmark result from this benchmark against current master:

running 1 test
test bench_mul ... bench:         270 ns/iter (+/- 17)

cuviper · 2021-07-14T23:25:18Z

With my Ryzen 7 3800X, I get 248 ns/iter on that benchmark with current master, and 264 ns/iter with the suggested changes. Most of that time is spent in norm's call to hypot though. If I change that to z.norm_sqr() <= 4.0, I get 79 ns/iter and 105 ns/iter for the old and new code, respectively.

cuviper · 2021-07-14T23:40:35Z

Here are the two implementations on godbolt, using f64: https://rust.godbolt.org/z/558nzKaso

It doesn't look surprising at this level -- indeed showing 4 mulsd versus only 3 in the new, but one more instruction total.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation suggestion for complex multiplication to improve performance #91

Implementation suggestion for complex multiplication to improve performance #91

Fulguritude commented Jul 4, 2021

cuviper commented Jul 8, 2021

Fulguritude commented Jul 9, 2021 •

edited

Loading

twirrim commented Jul 14, 2021

cuviper commented Jul 14, 2021

cuviper commented Jul 14, 2021

Implementation suggestion for complex multiplication to improve performance #91

Implementation suggestion for complex multiplication to improve performance #91

Comments

Fulguritude commented Jul 4, 2021

cuviper commented Jul 8, 2021

Fulguritude commented Jul 9, 2021 • edited Loading

twirrim commented Jul 14, 2021

cuviper commented Jul 14, 2021

cuviper commented Jul 14, 2021

Fulguritude commented Jul 9, 2021 •

edited

Loading