-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Pixel Maps #602
Parallel Pixel Maps #602
Conversation
I've just realized this was attempted in #446 but was closed due to being slower than the original. After reading the PR it looks like it used |
Here are the results:
As can be seen which function is faster depends on the size of the image, 100x100 images are faster with the non-parallel version whereas 1000x1000 images are faster with the parallel version. |
my PC is slower than yours (probably all the other stuff i'm running), but i might have something to help for the inefficiencies for low values. lots of time is lost on dispatching, so we need to dispatch less, by parallelizing on lines instead of pixels. here are my results :
my impl on my pc
it is still much slower for very small images, so one could just default to sequential for images under 10k pixels
|
here is my take on the function body {
use itertools::Itertools;
use rayon::iter::IndexedParallelIterator;
use rayon::iter::ParallelIterator;
use rayon::slice::{ParallelSlice, ParallelSliceMut};
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
image
.par_chunks(width as usize * <P as Pixel>::CHANNEL_COUNT as usize)
.zip_eq(
out.par_chunks_mut(
width as usize * <ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize,
),
)
.for_each(|(input_line, output_line)| {
input_line
.chunks_exact(<P as Pixel>::CHANNEL_COUNT as usize)
.map(|v| <P as Pixel>::from_slice(v))
.zip_eq(
output_line
.chunks_exact_mut(<ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize)
.map(|v| <ChannelMap<P, S> as Pixel>::from_slice_mut(v)),
)
.for_each(|(input_pixel, output_channels)| {
input_pixel
.channels()
.iter()
.zip(output_channels.channels_mut())
.for_each(|(input_subpixel, output_subpixel)| {
*output_subpixel = f(*input_subpixel)
})
});
});
out
} this is more efficient the faster the function is, which is great when it's just a sum or multiplication |
Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000. |
i have a better iteration, that is both faster for small values and more readable {
use rayon::iter::IndexedParallelIterator;
use rayon::iter::ParallelIterator;
use rayon::slice::{ParallelSlice, ParallelSliceMut};
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
const SUBPIXEL_PER_CHUNK : usize = 512;
image
.par_chunks(SUBPIXEL_PER_CHUNK)
.zip_eq(out.par_chunks_mut(SUBPIXEL_PER_CHUNK))
.for_each(|(input_line, output_line)| {
input_line.iter().zip(output_line.iter_mut()).for_each(
|(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
);
});
out
} |
|
I've added you to my fork in case you want to update the implementation as we find faster versions. |
sadly for me it doesn't seem to improve much beyond 3x. might be my pc's fault
|
the values i get from my tests are really weird, i think there might be something weird going on. it'd be nice if you could test it on your machine. |
it says it's 10x faster for a 30 by 30 but only about 3x faster for a 1000 by 1000
which is extra weird because for 30 by 30 there should only be one thread running so there shouldn't be any significant improvement |
small question : why does |
|
well, i have bad news |
The benches on my machine (6-core Ryzen 5 4500U):
|
what happens if you replace the body by {
use itertools::Itertools;
let (width, height) = image.dimensions();
let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);
image.iter().zip_eq(out.iter_mut()).for_each(
|(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
);
out
}
? |
|
well it seems the parallel version is slightly faster for big images |
Well at least we've learnt that there is massive room for optimization via |
still a bit perplexed at how inefficient the parallelization is, there might be something else to it i'm missing |
From the cargo docs:
I think |
here is what happen if i remove the
the good value of |
i have also come to realize that chunking manually is actually pointless, because that's already what rayon does |
T_T i thought i had done clippy on my machine |
I've de-duplicated the benchmark mapping function into the macro and switched |
The benchmarks take forever (200s) on my machine mainly due to the 10_000 functions (especially the sequential one), we'll probably have to remove those for the final PR. |
makes sense |
Thanks for the PR. I’ll wait until after the next release to merge this. Which will probably be next weekend as I don’t have much time on weekdays. |
src/map.rs
Outdated
S: std::marker::Send, | ||
S: std::marker::Sync, | ||
<P as WithChannel<S>>::Pixel: std::marker::Send, | ||
<P as WithChannel<S>>::Pixel: std::marker::Sync, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This where
clause is horrific! Is there any way of shortening it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could add use std::marker::{Send, Sync}
at the top of the file and then
S: std::marker::Send,
S: std::marker::Sync,
would become
S: Send + Sync,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the final result would be something like that :
pub fn map_subpixels_parallel<I, P, F, S>(image: &I, f: F) -> Image<ChannelMap<P, S>>
where
I: GenericImage<Pixel = P> + Deref<Target = [P::Subpixel]>,
P: WithChannel<S> + Send + Sync,
S: Primitive + Send + Sync,
F: Fn(P::Subpixel) -> S + Sync,
P::Subpixel: Send + Sync,
<P as WithChannel<S>>::Pixel: Send + Sync,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Send
and Sync
are already in rust prelude
…e parallel vs sequential version
Okay now that all the blockers have been resolved I've fleshed out the rest of this PR.
|
@theotherphil this is now ready for review. |
The new code looks good and the signature change makes sense. I’m not convinced by the name changes though. You’re right that |
That's true, but it is the lesser evil in my opinion as users can see their signatures fairly easily from the docs. Unless you have an alternative suggestion? |
One alternative: |
I’ll merge as is and think about what best to name these before the next release. Thanks for the new functions, and it’s good to see your new doc macros already paying off. |
This is just a draft so I can get some feedback on an initial implementation of
map_subpixels_parallel()
before I add the remaining variation of functionsmap_subpixels_parallel_mut()
map_pixels_parallel()
map_pixels_parallel_mut()
.This would address #429