Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

sunsided · 2024-05-18T16:34:32Z

This adds a parallelized implementation for the ~~filter3x3~~ filter_clamped function implementation.

ripytide · 2024-05-18T16:47:11Z

Normally, we add parallel functions with _parallel on the end not via switching the same function via a feature. Otherwise a user would never be able to use the single and multi threaded version in the same application. Also cargo features are supposed to be additive which this breaks.

sunsided · 2024-05-18T17:32:30Z

That sounds like a solution to my problem. Will refactor it.

sunsided · 2024-05-19T12:01:34Z

@ripytide Rewrote it. There's quite some code duplication now with the bigger driver functions filter and gradients, but I figured it's easier to judge this way, and there might be more optimization potential there as well.

I also added the _parallel functions to the test and benchmarks now. Some output from my machine to give an idea:

test filter::benches::bench_filter3x3_i32_filter                                      ... bench:   1,685,176.00 ns/iter (+/- 8,628.49)
test filter::benches::bench_filter3x3_parallel_i32_filter                             ... bench:     429,208.09 ns/iter (+/- 44,841.39)
test seam_carving::benches::bench_shrink_width_s100_r1                                ... bench:     339,531.20 ns/iter (+/- 3,855.30)
test seam_carving::benches::bench_shrink_width_s100_r4                                ... bench:   1,332,217.10 ns/iter (+/- 3,669.44)
test seam_carving::benches::bench_shrink_width_s100_r8                                ... bench:   2,611,614.40 ns/iter (+/- 14,979.44)
test seam_carving::benches::bench_shrink_width_parallel_s100_r1                       ... bench:     123,885.59 ns/iter (+/- 16,124.79)
test seam_carving::benches::bench_shrink_width_parallel_s100_r4                       ... bench:     502,462.13 ns/iter (+/- 54,026.64)
test seam_carving::benches::bench_shrink_width_parallel_s100_r8                       ... bench:     983,152.75 ns/iter (+/- 141,549.54)

ripytide · 2024-05-19T12:40:37Z

At current this parallelizes each row of the image. In our discussions and experimentation in #602 we found that it may be worth simply parallelizing per pixel rather than per row as rayon does its own chunking of work to prevent excess dispatching and so there was little performance difference. Also this way it scales more flexibly with images with different row sizes.

Also regarding the code duplication, would it be possible to extract the kernel code from both functions and simply call it in normal for loops for the single-threaded versions, and call it in the par_pixels() from the multi-threaded version?

This refactors the previous commit by keeping the original versions of the filter functions as-is and adding the _parallel variants only if the rayon feature is enabled. There is some code duplication here now subject to further refactoring.

sunsided · 2024-05-19T15:38:42Z

Good point! I didn't even consider pixel-wise operations due to a somewhat baseless concern about hypothetical cache coherency issues, i.e. I was essentially on the same page as this thought, with the known gotcha that it degrades with special image dimensions.

Would you want me to try anyways or leave it like it is for the time being in order to improve on it later on? My thought here is that it is already much faster than the non-parallelized version, and with there (now) being the sequential and parallel versions available at the same time, caller code can make informed decisions about which one to prefer when. I'm fine with trying more (and will), but I'm not exactly a Rayon expert. 😅

Regardless of the parallelization, I'll definitely have a look at refactoring the code duplication away. ✌🏼

ripytide · 2024-05-19T15:46:31Z

As you say having a parallel implementation is much more important than making the implementation as fast as possible as that can always be improved in the future. So I'd say just make the code as readable and de-duplicated as you can and don't worry too much about micro-benchmarking.

src/gradients.rs

# Conflicts: # src/filter/mod.rs # src/filter/sharpen.rs # src/gradients.rs # src/seam_carving.rs

ripytide

Looks good overall 👍

I'm going to take a shot at re-writing filter() in such a way that the inner kernel bit is it's own separate function such that it can be called from the non-parallel and the parallel version.

examples/gradients.rs

sunsided · 2024-05-19T17:59:21Z

Circling back to the earlier comments, I did try with a pixel-wise parallelization via a naive out_data.par_chunks_mut(num_channels) instead of out_data.par_chunks_mut(width as usize * num_channels), but it degraded performance compared to the rowwise version. I'll give it another thought, but it's probably for another time.

…()` function

ripytide · 2024-05-19T20:15:20Z

Okay I've finished the re-write from scratch commit here using an inner per-pixel filter function and according to my testing the new version is [1.02x faster on sequential, 1.2x faster on parallel] which is within margin of error. It's also de-duplicated as discussed.

Do you want to pull that commit onto this PR or me to do a separate PR after this one is merged?

sunsided · 2024-05-20T11:54:58Z

@ripytide Cool! I merged it in here.

theotherphil · 2024-05-21T08:49:39Z

Thanks for this. The new functionality looks good. However, I'm not keen on reducing the flexibility of the existing filter function for the sake of saving a bit of code duplication with the new function. And I'm not sure what impact this refactoring has on the runtime performance of the non-parallel filter function. Please could you check the benchmarks before and after this change.

ripytide · 2024-05-21T10:31:42Z

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation.
And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

ripytide · 2024-05-21T10:34:53Z

According to my benchmarking the non-parallel version is within noise-margin of the speed of the previous implementation. Here is the raw benchmark comparison output:

 name                                                                          before  ns/iter  after  ns/iter  diff ns/iter   diff %  speedup 
 filter::benches::bench_box_filter                                             2,068,955        2,068,628               -327   -0.02%   x 1.00 
 filter::benches::bench_filter_clamped_i32_filter                              3,304,223        3,224,476            -79,747   -2.41%   x 1.02 
 filter::benches::bench_gaussian_f32_stdev_1                                   286,579          286,494                  -85   -0.03%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_10                                  1,384,830        1,386,037              1,207    0.09%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_3                                   527,508          526,958                 -550   -0.10%   x 1.00 
 filter::benches::bench_horizontal_filter                                      906,407          904,490               -1,917   -0.21%   x 1.00 
 filter::benches::bench_separable_filter                                       667,202          667,295                   93    0.01%   x 1.00 
 filter::benches::bench_vertical_filter                                        940,168          940,354                  186    0.02%   x 1.00

theotherphil · 2024-05-21T12:41:01Z

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation. And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

Yes, I meant this change to the function signature. I don’t feel too strongly - it's probably a pretty niche use case and if we do break anyone’s code I guess they’ll let us know. I’ll merge this after the merge conflict is resolved.

ripytide · 2024-05-25T18:54:06Z

I've rebased this in #642

sunsided · 2024-05-26T11:23:01Z

I've rebased this in #642

Thank you! Should we close this PR here then?

ripytide · 2024-05-26T11:28:41Z

Yeah can do

sunsided · 2024-05-26T13:12:00Z

Closing in favor of #642

sunsided added 4 commits May 19, 2024 16:41

Add Rayon implementation for filter3x3

b728c42

Parallelize processing on output rows for filter3x3

c42cf58

Add missing parallel filter function and improve documentation

a27a8d9

sunsided force-pushed the feature/filter-parallel branch from 8326a53 to a27a8d9 Compare May 19, 2024 14:42

Add docs.rs feature gate annotations from image-rs#610

7286863

ripytide reviewed May 19, 2024

View reviewed changes

src/gradients.rs Outdated Show resolved Hide resolved

Merge branch 'master' into feature/filter-parallel

dd4d1c3

# Conflicts: # src/filter/mod.rs # src/filter/sharpen.rs # src/gradients.rs # src/seam_carving.rs

sunsided force-pushed the feature/filter-parallel branch 2 times, most recently from 6855ae7 to a1cadb1 Compare May 19, 2024 16:57

Add tests for filter_clamped_parallel

8f61014

sunsided force-pushed the feature/filter-parallel branch from a1cadb1 to c6a6f5f Compare May 19, 2024 17:01

Add parallelized sharpen3x3

5167cac

sunsided force-pushed the feature/filter-parallel branch from c6a6f5f to 5167cac Compare May 19, 2024 17:04

sunsided requested review from theotherphil and ripytide May 19, 2024 17:09

sunsided changed the title ~~Add Rayon implementation for Kernel filter function (filter3x3, ...)~~ Add Rayon implementation filter_clamped_parallel for Kernel filter function (formerly filter3x3, ...) May 19, 2024

Fix Clippy-reported lint and fix missing feature gate

5e5550c

sunsided force-pushed the feature/filter-parallel branch from 5b651cb to 5e5550c Compare May 19, 2024 17:28

ripytide reviewed May 19, 2024

View reviewed changes

examples/gradients.rs Outdated Show resolved Hide resolved

Remove parallel implementation from gradient example

a259e97

finished re-implementation of filter() using an inner `filter_pixel…

2b8fb41

…()` function

sunsided mentioned this pull request May 20, 2024

Reduce code duplication in horizontal and vertical filter functions #624

Merged

ripytide mentioned this pull request May 25, 2024

Add filter_map_parallel #642

Merged

sunsided closed this May 26, 2024

theotherphil mentioned this pull request May 27, 2024

Add hardcoded filter3x3 function to benchmark current generic filter function against #644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

sunsided commented May 18, 2024 •

edited

Loading

ripytide commented May 18, 2024

sunsided commented May 18, 2024 •

edited

Loading

sunsided commented May 19, 2024

ripytide commented May 19, 2024 •

edited

Loading

sunsided commented May 19, 2024

ripytide commented May 19, 2024

ripytide left a comment •

edited

Loading

sunsided commented May 19, 2024

ripytide commented May 19, 2024 •

edited

Loading

sunsided commented May 20, 2024

theotherphil commented May 21, 2024

ripytide commented May 21, 2024

ripytide commented May 21, 2024 •

edited

Loading

theotherphil commented May 21, 2024

ripytide commented May 25, 2024

sunsided commented May 26, 2024

ripytide commented May 26, 2024

sunsided commented May 26, 2024

Add Rayon implementation filter_clamped_parallel for Kernel filter function (formerly filter3x3, ...) #608

Add Rayon implementation filter_clamped_parallel for Kernel filter function (formerly filter3x3, ...) #608

Conversation

sunsided commented May 18, 2024 • edited Loading

ripytide commented May 18, 2024

sunsided commented May 18, 2024 • edited Loading

sunsided commented May 19, 2024

ripytide commented May 19, 2024 • edited Loading

sunsided commented May 19, 2024

ripytide commented May 19, 2024

ripytide left a comment • edited Loading

Choose a reason for hiding this comment

sunsided commented May 19, 2024

ripytide commented May 19, 2024 • edited Loading

sunsided commented May 20, 2024

theotherphil commented May 21, 2024

ripytide commented May 21, 2024

ripytide commented May 21, 2024 • edited Loading

theotherphil commented May 21, 2024

ripytide commented May 25, 2024

sunsided commented May 26, 2024

ripytide commented May 26, 2024

sunsided commented May 26, 2024

Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

Add Rayon implementation `filter_clamped_parallel` for Kernel filter function (formerly `filter3x3`, ...) #608

sunsided commented May 18, 2024 •

edited

Loading

sunsided commented May 18, 2024 •

edited

Loading

ripytide commented May 19, 2024 •

edited

Loading

ripytide left a comment •

edited

Loading

ripytide commented May 19, 2024 •

edited

Loading

ripytide commented May 21, 2024 •

edited

Loading