-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel reduction supports for RowIterator and NamedTupleIterator #187
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #187 +/- ##
==========================================
+ Coverage 96.71% 96.80% +0.09%
==========================================
Files 6 6
Lines 456 469 +13
==========================================
+ Hits 441 454 +13
Misses 15 15
Continue to review full report at Codecov.
|
Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:
|
Hi, thanks for the response and sorry for this late reply.
Yes, I understand this and I should've clarified what SplittablesBase.jl is. Essentially I am hoping I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.
The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to
If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for
My intention is making it very minimal although I have to put the implementation for I think it's almost 1.0-ready but there is one specification of an optional API If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak |
This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package |
I'm not sure this is true, given
I think SplittablesBase.jl is fine, given it's a very small dependency. cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome. |
This PR implements SplittablesBase.jl interface
halve
onRowIterator
andNamedTupleIterator
. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:A tricky part of this PR is that, since
SplittablesTesting.test_ordered
usesisequal
to compare items (rows), I needed to relaxisequal
to ignore the storage type of columns. The difference is thatis
false
before this PR andtrue
after this PR. I think it makes sense thatColumnsRow
to be compared as if they are lowered toNamedTuple
s. This is also compatible with that the equalities on arrays ignore the typeWhat do you think?