Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rayon implementation filter_clamped_parallel for Kernel filter function (formerly filter3x3, ...) #608

Closed
wants to merge 11 commits into from

Conversation

sunsided
Copy link
Contributor

@sunsided sunsided commented May 18, 2024

This adds a parallelized implementation for the filter3x3 filter_clamped function implementation.

@ripytide
Copy link
Member

Normally, we add parallel functions with _parallel on the end not via switching the same function via a feature. Otherwise a user would never be able to use the single and multi threaded version in the same application. Also cargo features are supposed to be additive which this breaks.

@sunsided
Copy link
Contributor Author

sunsided commented May 18, 2024

That sounds like a solution to my problem. Will refactor it.

@sunsided
Copy link
Contributor Author

@ripytide Rewrote it. There's quite some code duplication now with the bigger driver functions filter and gradients, but I figured it's easier to judge this way, and there might be more optimization potential there as well.

I also added the _parallel functions to the test and benchmarks now. Some output from my machine to give an idea:

test filter::benches::bench_filter3x3_i32_filter                                      ... bench:   1,685,176.00 ns/iter (+/- 8,628.49)
test filter::benches::bench_filter3x3_parallel_i32_filter                             ... bench:     429,208.09 ns/iter (+/- 44,841.39)
test seam_carving::benches::bench_shrink_width_s100_r1                                ... bench:     339,531.20 ns/iter (+/- 3,855.30)
test seam_carving::benches::bench_shrink_width_s100_r4                                ... bench:   1,332,217.10 ns/iter (+/- 3,669.44)
test seam_carving::benches::bench_shrink_width_s100_r8                                ... bench:   2,611,614.40 ns/iter (+/- 14,979.44)
test seam_carving::benches::bench_shrink_width_parallel_s100_r1                       ... bench:     123,885.59 ns/iter (+/- 16,124.79)
test seam_carving::benches::bench_shrink_width_parallel_s100_r4                       ... bench:     502,462.13 ns/iter (+/- 54,026.64)
test seam_carving::benches::bench_shrink_width_parallel_s100_r8                       ... bench:     983,152.75 ns/iter (+/- 141,549.54)

@ripytide
Copy link
Member

ripytide commented May 19, 2024

At current this parallelizes each row of the image. In our discussions and experimentation in #602 we found that it may be worth simply parallelizing per pixel rather than per row as rayon does its own chunking of work to prevent excess dispatching and so there was little performance difference. Also this way it scales more flexibly with images with different row sizes.

Also regarding the code duplication, would it be possible to extract the kernel code from both functions and simply call it in normal for loops for the single-threaded versions, and call it in the par_pixels() from the multi-threaded version?

sunsided added 4 commits May 19, 2024 16:41
This refactors the previous commit by keeping the original versions of the
filter functions as-is and adding the _parallel variants only if the rayon
feature is enabled. There is some code duplication here now subject to
further refactoring.
@sunsided sunsided force-pushed the feature/filter-parallel branch from 8326a53 to a27a8d9 Compare May 19, 2024 14:42
@sunsided
Copy link
Contributor Author

Good point! I didn't even consider pixel-wise operations due to a somewhat baseless concern about hypothetical cache coherency issues, i.e. I was essentially on the same page as this thought, with the known gotcha that it degrades with special image dimensions.

Would you want me to try anyways or leave it like it is for the time being in order to improve on it later on? My thought here is that it is already much faster than the non-parallelized version, and with there (now) being the sequential and parallel versions available at the same time, caller code can make informed decisions about which one to prefer when. I'm fine with trying more (and will), but I'm not exactly a Rayon expert. 😅

Regardless of the parallelization, I'll definitely have a look at refactoring the code duplication away. ✌🏼

@ripytide
Copy link
Member

As you say having a parallel implementation is much more important than making the implementation as fast as possible as that can always be improved in the future. So I'd say just make the code as readable and de-duplicated as you can and don't worry too much about micro-benchmarking.

src/gradients.rs Outdated Show resolved Hide resolved
# Conflicts:
#	src/filter/mod.rs
#	src/filter/sharpen.rs
#	src/gradients.rs
#	src/seam_carving.rs
@sunsided sunsided force-pushed the feature/filter-parallel branch 2 times, most recently from 6855ae7 to a1cadb1 Compare May 19, 2024 16:57
@sunsided sunsided force-pushed the feature/filter-parallel branch from a1cadb1 to c6a6f5f Compare May 19, 2024 17:01
@sunsided sunsided force-pushed the feature/filter-parallel branch from c6a6f5f to 5167cac Compare May 19, 2024 17:04
@sunsided sunsided requested review from theotherphil and ripytide May 19, 2024 17:09
@sunsided sunsided changed the title Add Rayon implementation for Kernel filter function (filter3x3, ...) Add Rayon implementation filter_clamped_parallel for Kernel filter function (formerly filter3x3, ...) May 19, 2024
@sunsided sunsided force-pushed the feature/filter-parallel branch from 5b651cb to 5e5550c Compare May 19, 2024 17:28
Copy link
Member

@ripytide ripytide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall 👍

I'm going to take a shot at re-writing filter() in such a way that the inner kernel bit is it's own separate function such that it can be called from the non-parallel and the parallel version.

examples/gradients.rs Outdated Show resolved Hide resolved
@sunsided
Copy link
Contributor Author

Circling back to the earlier comments, I did try with a pixel-wise parallelization via a naive out_data.par_chunks_mut(num_channels) instead of out_data.par_chunks_mut(width as usize * num_channels), but it degraded performance compared to the rowwise version. I'll give it another thought, but it's probably for another time.

@ripytide
Copy link
Member

ripytide commented May 19, 2024

Okay I've finished the re-write from scratch commit here using an inner per-pixel filter function and according to my testing the new version is [1.02x faster on sequential, 1.2x faster on parallel] which is within margin of error. It's also de-duplicated as discussed.

Do you want to pull that commit onto this PR or me to do a separate PR after this one is merged?

@sunsided
Copy link
Contributor Author

@ripytide Cool! I merged it in here.

@theotherphil
Copy link
Contributor

Thanks for this. The new functionality looks good. However, I'm not keen on reducing the flexibility of the existing filter function for the sake of saving a bit of code duplication with the new function. And I'm not sure what impact this refactoring has on the runtime performance of the non-parallel filter function. Please could you check the benchmarks before and after this change.

@ripytide
Copy link
Member

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation.
And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

@ripytide
Copy link
Member

ripytide commented May 21, 2024

According to my benchmarking the non-parallel version is within noise-margin of the speed of the previous implementation. Here is the raw benchmark comparison output:

 name                                                                          before  ns/iter  after  ns/iter  diff ns/iter   diff %  speedup 
 filter::benches::bench_box_filter                                             2,068,955        2,068,628               -327   -0.02%   x 1.00 
 filter::benches::bench_filter_clamped_i32_filter                              3,304,223        3,224,476            -79,747   -2.41%   x 1.02 
 filter::benches::bench_gaussian_f32_stdev_1                                   286,579          286,494                  -85   -0.03%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_10                                  1,384,830        1,386,037              1,207    0.09%   x 1.00 
 filter::benches::bench_gaussian_f32_stdev_3                                   527,508          526,958                 -550   -0.10%   x 1.00 
 filter::benches::bench_horizontal_filter                                      906,407          904,490               -1,917   -0.21%   x 1.00 
 filter::benches::bench_separable_filter                                       667,202          667,295                   93    0.01%   x 1.00 
 filter::benches::bench_vertical_filter                                        940,168          940,354                  186    0.02%   x 1.00 

@theotherphil
Copy link
Contributor

However, I'm not keen on reducing the flexibility of the existing filter

What do you mean by reducing the flexibility of filter(), it has an identical function signature?

Well except for f changing from FnMut(&mut Q::Subpixel, K) to Fn(K) -> Q::Subpixel which I would argue is a clearer function signature for what is essentially a mapping operation. And technically the bounds also changed from Q::Subpixel: Into<K> to K: From<Q::Subpixel> but I'm not sure if there's a difference between them.

Yes, I meant this change to the function signature. I don’t feel too strongly - it's probably a pretty niche use case and if we do break anyone’s code I guess they’ll let us know. I’ll merge this after the merge conflict is resolved.

@ripytide ripytide mentioned this pull request May 25, 2024
@ripytide
Copy link
Member

I've rebased this in #642

@sunsided
Copy link
Contributor Author

I've rebased this in #642

Thank you! Should we close this PR here then?

@ripytide
Copy link
Member

Yeah can do

@sunsided
Copy link
Contributor Author

Closing in favor of #642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants