-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port to StructArrays #203
Port to StructArrays #203
Conversation
I'm curious --- what are the performance problems with The fact that this is a lot of work means we need to take a hard look at our interfaces and clean them up. We need to make sure whichever implementation we use (Columns or StructArrays) strictly adheres to the array API, plus whatever minimal extensions are needed to take advantage of the column representation. |
I have a few things I'd like to mention, so I'll try to give some structure to this post. InterfaceWhen I said a lot of work I may have exaggerated, but certainly there is not a clean interface such that by overloading 3 or 4 functions for
SortpermStructArrays does not depend on PooledArrays and in my mind it shouldn't as the two concepts are orthogonal. The main difference in behavior due to this is that calling Collection mechanism
This actually simplifies the collection code and makes it more general. In particular Collect benchmarksI think the performance difference is not strictly with using BenchmarkTools
using StructArrays, IndexedTables
N = 10000;
v1 = StructArray(a=rand(1:10, N), b=rand(1:10, N), c=rand(1:10, N)) ;
v2 = Columns(a=rand(1:10, N), b=rand(1:10, N), c=rand(1:10, N)) ;
f1(v) = StructArrays.collect_structarray(v[i] for i in eachindex(v))
f2(v) = collect_columns(v[i] for i in eachindex(v))
f3(v) = StructArrays.collect_structarray(StructArrays.LazyRow(v, i) for i in eachindex(v))
@btime f1(v1);
@btime f2(v1);
@btime f3(v1);
@btime f1(v2);
@btime f2(v2); That gives: julia> @btime f1(v1);
30.013 μs (14 allocations: 234.80 KiB)
julia> @btime f2(v1);
39.197 μs (33 allocations: 235.59 KiB)
julia> @btime f3(v1);
31.663 μs (14 allocations: 234.80 KiB)
julia> @btime f1(v2);
31.332 μs (14 allocations: 234.80 KiB)
julia> @btime f2(v2);
40.534 μs (33 allocations: 235.59 KiB) From which (if I'm not doing anything wrong) I would say that the container type ( A possible way forward would be to first port things to StructArrays (to avoid code duplication and have a more general / widely used array type on which to build on). Then this would allow to use both eager iteration or |
Just to expand a bit on the choice of a good set of constructors, I had a write-up from slack that I'm copypasting here:
And it is probably the one thing about the StructArrays API where I'm still unsure and would like feedback. @shashi seemed to find option 2 less confusing and personally I'm OK with either option. |
Wow, thank you for the detailed write up. In general I'm fine with porting to StructArrays, but I have some comments.
I don't really like this --- it's basically an implicit map/broadcast of the kind we don't do anymore. A correct expression would be something like
I think that's fine; it's slightly awkward that names have to be invented in the case of tuples (since tuples are the only "struct" that don't have field names) but it's not the end of the world.
+1 to supporting more dimensions. I'm not sure
It seems to me we should try to eliminate this. Specially supporting both NamedTuple and Tuple makes at least a bit of sense, since they represent objects with symbol and integer keys, respectively. But I don't know why Pair would be special. Probably best to just use NamedTuple always.
👍 totally agree. We should follow other array type constructors; anything else will just cause more pain in the future. So
👍
As long as the default is just to unwrap one level I'm fine with it --- we don't (yet) need anything else in IndexedTables/JuliaDB.
I'm indifferent to this for now, but I don't think this could be called the "right" abstraction for this. It seems like in general you want to specify a layout of which fields to unwrap, which has nothing to do with the types of those fields.
👍 Sounds good. I think it's quite likely LazyRow should be the default. |
I guess my matlab training is showing here. The autovectorization
It may be easier to first get tests passing by cheating by having
At the moment those two things are
I'm not 100% sure what the right abstraction could be. On the plus side, the code is organized in such a way that the actual code to fill in the columns and the initializer are fully decoupled so it should be straightforward to change the API to define the unwrapping details (unwrapping is only relevant to initialize things, not for the actual collection). |
Totally agree. It would still be nice to have a way to get a container of the columns though --- one issue with current approach is that for indexing and iteration a StructArray acts like a container of structs, but for properties it acts like a container of arrays. We should have a function to explicitly obtain a container of columns.
Yes, that's an improvement for now. It's quite possible that eventually |
Sounds reasonable, but then I think
|
end | ||
elseif copy | ||
cs = Base.copy(cs) | ||
cs = copyto!(similar(cs), cs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this change is necessary? Shouldn't copy
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using copy
would create a subtle bug. This code expects similar
and copy
to return things of the same value (for index
and index_buffer
for example). However this is in general false for arrays. copy(1:3) isa UnitRange
whereas similar(1:3) isa Array
. This used to work with Columns
because Columns
has no custom method for copy
, so it would default to copymutable(x) = copyto!(similar(x), x)
which would in turn make all columns mutable. copy
for a StructArray
is instead defined as column-wise copy (which I think is correct).
I think copymutable
is the correct choice here in general as we want to be able to modify the indexing arrays as well as the buffer arrays, but I don't think copying a StructArray
should in general make the field arrays mutable.
👍 Thanks for the explanation. The |
I've opened an issue to discuss the question in general at JuliaData/Tables.jl#52. Returning a table is also possible, but I would like to somehow have the concept of an iterator where some of it is sorted. For example I'd like to do I would love to have feedback on this issue but I'd recommend continuing the discussion at JuliaData/Tables.jl#52 as I think it's relevant in general for the tabular data ecosystem. |
This is extremely wip and it actually looks like it may be a massive work. The idea is to replace
Columns
withStructVector
from the StructArray package.StructVector
is strictly more general and a bit more performant.