Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to StructArrays #203

Merged
merged 21 commits into from
Dec 20, 2018
Merged

Port to StructArrays #203

merged 21 commits into from
Dec 20, 2018

Conversation

piever
Copy link
Collaborator

@piever piever commented Dec 16, 2018

This is extremely wip and it actually looks like it may be a massive work. The idea is to replace Columns with StructVector from the StructArray package. StructVector is strictly more general and a bit more performant.

@JeffBezanson
Copy link
Contributor

I'm curious --- what are the performance problems with Columns?

The fact that this is a lot of work means we need to take a hard look at our interfaces and clean them up. We need to make sure whichever implementation we use (Columns or StructArrays) strictly adheres to the array API, plus whatever minimal extensions are needed to take advantage of the column representation. Columns is definitely not perfect on this score. But we should take advantage of this opportunity. The special-case methods for Tuple, NamedTuple, and Pair worry me. Are they really necessary?

@piever
Copy link
Collaborator Author

piever commented Dec 17, 2018

I have a few things I'd like to mention, so I'll try to give some structure to this post.

Interface

When I said a lot of work I may have exaggerated, but certainly there is not a clean interface such that by overloading 3 or 4 functions for StructArray everything just works. The main "pain points" / differences between Columns and StructVector are:

  • StructArray overloads getproperty to be able to use s.re to get the array of real components if s is a StructArray{ComplexF64} for example. Instead in IndexedTables the code often use c.columns to access the named tuple of arrays. This should definitely be cleaned up.
  • getfield(c, :columns) in IndexedTables is a bit different from getfield(s, :fieldarrays). In IndexedTables, the way columns are stored depends on the eltype, so that Columns{<:Tuple} stores as a Tuple, Columns{<:NamedTuple} stores as a NamedTuple and Columns{Pair} stores as a Pair. Given that StructArrays are supposed to work with any struct (maybe user defined) this doesn't quite work in general (what would be the columnar storage for StructArray{ComplexF64}?), so the various arrays are always stored as a NamedTuple
  • There is some nomenclature difference: given that StructArray can be higher dimensionsal, the arrays corresponding to the various fields of the eltype are called fieldarrays rather than columns. Otherwise for a matrix of complex number, say StructArray{ComplexF64}(undef, (4, 5)) it'd be very confusing that the columns are not matrix columns but real and imaginary components instead. At the momen I have overloaded columns in IndexedTables to return the fieldarrays with the caviat of the point above that they should be converted to Tuple or Pair if required (for backward compatibility): I'm not sure about this.
  • StructArray has less constructors that Columns. I don't really like the StructArray(args...) = StructArray(args) constructor because it gets confusing when there is only one column. Somehow, I would expect Columns(c::Columns) = c rather than Columns(c::Columns) = Columns((c,)). I also don't have StructArray(args...; names = ...) for the same reason but it may be OK with a mandatory keyword argument (I would like to check how well it infers though, because I'm trying to have all constructors be inferrable, maybe it's OK if one passes names as a tuple)

Sortperm

StructArrays does not depend on PooledArrays and in my mind it shouldn't as the two concepts are orthogonal. The main difference in behavior due to this is that calling sortperm on a StructArray doesn't construct a PooledArray with the first column. I think it'd be a bit strange to have sortperm(StructArray(a = ["a", "b", "c"])) create a PooledArray when sortperm(["a", "b", "c"]) doesn't (correctly so I believe as it'd be counter productive if all strings are different). I think this is the correct approach, but we should either encourage users to work with PooledArray already (if they have repeated entries) or use some heuristic to decide when/whether to convert under the hood in operations that need to group data.

Collection mechanism

StructArrays tries to work with any struct, but I think when collecting an iterable of structs into a StructArray there is a fundamental ambiguity as to whether to "unwrap" or not. If I'm collecting an iterable of objects of type Tuple{ComplexF64, ComplexF64} do I want two Array{ComplexF64} or should we further unwrap into StructArray{ComplexF64} (so in practice we'd have four arrays Array{Float64}. This at the moment is defined by the ArrayInitializer one can pass to collect_structarray. For example one can do collect_structarray(itr; initializer = ArrayInitializer(t -> t <: Union{Tuple, NamedTuple, Pair})) to only "unwrap" these types and leave ComplexF64 alone.

This actually simplifies the collection code and makes it more general. In particular collect_structarray also allows to collect LazyRow which are a lazy way of iterating a StructArray (LazyRow(v, i) is an object such that LazyRow(v, i).re == v.re[i].

Collect benchmarks

I think the performance difference is not strictly with Columns but with the collect_columns machinery. I'm using this small benchmark script (on IndexedTables and StructArrays master):

using BenchmarkTools
using StructArrays, IndexedTables
N = 10000;
v1 = StructArray(a=rand(1:10, N), b=rand(1:10, N), c=rand(1:10, N)) ;
v2 = Columns(a=rand(1:10, N), b=rand(1:10, N), c=rand(1:10, N)) ;
f1(v) = StructArrays.collect_structarray(v[i] for i in eachindex(v))
f2(v) = collect_columns(v[i] for i in eachindex(v))
f3(v) = StructArrays.collect_structarray(StructArrays.LazyRow(v, i) for i in eachindex(v))

@btime f1(v1);
@btime f2(v1);
@btime f3(v1);
@btime f1(v2);
@btime f2(v2);

That gives:

julia> @btime f1(v1);
  30.013 μs (14 allocations: 234.80 KiB)

julia> @btime f2(v1);
  39.197 μs (33 allocations: 235.59 KiB)

julia> @btime f3(v1);
  31.663 μs (14 allocations: 234.80 KiB)

julia> @btime f1(v2);
  31.332 μs (14 allocations: 234.80 KiB)

julia> @btime f2(v2);
  40.534 μs (33 allocations: 235.59 KiB)

From which (if I'm not doing anything wrong) I would say that the container type (Columns versus StructVector) doesn't really matter but collect_structarray (both on eager or lazy rows) is a bit faster than collect_columns (though probably this doesn't matter so much if we are doing non-trivial operations).

A possible way forward would be to first port things to StructArrays (to avoid code duplication and have a more general / widely used array type on which to build on). Then this would allow to use both eager iteration or LazyRow iteration for table operations here and we should just decide what's the default. For example, should map(t -> t.SepalLength / t.SepalWidth, iris) iterate lazy rows so that one doesn't even need to read the other fields of the dataset (but t is this funny object rather than a NamedTuple), or should we iterate the whole NamedTuple?

@piever
Copy link
Collaborator Author

piever commented Dec 17, 2018

Just to expand a bit on the choice of a good set of constructors, I had a write-up from slack that I'm copypasting here:

In terms of constructors, Columns accepts many arguments and takes that to be a tuple, i.e. Columns(1:10, 1:10) would be the same as Columns((1:10, 1:10)) and would iterate tuples. Similarly Columns(x=1:10, y=1:10) is the same as Columns((x=1:10, y=1:10)) and iterates namedtuples.
I support all these methods at the moment except StructArray(args...) as that could conflict with StructArray(v) (which transforms an actual vector of structs, say [(1,2), (2,3)] into a StructArray). I think there are two ways forwards:

  1. Status quo: error on StructArray(args...) and support StructArray(v)
  2. Deprecate StructArray(v) in favor of convert(StructArray, v) and support StructArray(args...) (useful as some sort of zip in steroids, in that it creates an AbstractArray whose elements are tuples)

I'm unsure about this one and was curious if there were strong arguments one side of the other.
To further complicate things, at the moment I support the args... method if one specifies what the "struct" is via a type parameter, e.g. StructArray{ComplexF64}(rand(10), rand(10)) which I'd rather keep as it is quite handy (and it goes more harmoniously with 2 than with 1)

And it is probably the one thing about the StructArrays API where I'm still unsure and would like feedback. @shashi seemed to find option 2 less confusing and personally I'm OK with either option.

@JeffBezanson
Copy link
Contributor

Wow, thank you for the detailed write up. In general I'm fine with porting to StructArrays, but I have some comments.

  • StructArray overloads getproperty to be able to use s.re to get the array of real components

I don't really like this --- it's basically an implicit map/broadcast of the kind we don't do anymore. A correct expression would be something like map(x->x.re, s), but ideally with more convenient syntax of course. I think we need a better solution here.

  • so the various arrays are always stored as a NamedTuple

I think that's fine; it's slightly awkward that names have to be invented in the case of tuples (since tuples are the only "struct" that don't have field names) but it's not the end of the world.

  • given that StructArray can be higher dimensionsal, the arrays corresponding to the various fields of the eltype are called fieldarrays rather than columns

+1 to supporting more dimensions. I'm not sure columns is really such a bad name though. I wonder if there is a standard term for this in the struct-of-arrays world? Maybe "stripes"? We should investigate.

  • that they should be converted to Tuple or Pair if required (for backward compatibility)

It seems to me we should try to eliminate this. Specially supporting both NamedTuple and Tuple makes at least a bit of sense, since they represent objects with symbol and integer keys, respectively. But I don't know why Pair would be special. Probably best to just use NamedTuple always.

  • StructArray has less constructors that Columns. I don't really like the StructArray(args...) = StructArray(args) constructor because it gets confusing when there is only one column.

👍 totally agree. We should follow other array type constructors; anything else will just cause more pain in the future. So StructArray(x) should expect x to be an iterator that generates structs. To construct from existing arrays, that would mean something like StructArray(zip(array1, array2, ...)). To specify the type of struct, it would be e.g. StructArray(zipwith(StructType, a1, a2, ...)). We don't actually have zipwith, but Base.Generator happens to have that exact behavior. Those are not good names, but maybe StructArray could export a convenient function/type to use for that.
The constructor StructArray(name1=array1, name2=array2, ...) might be ok since it doesn't conflict with other array constructors.

StructArrays does not depend on PooledArrays and in my mind it shouldn't as the two concepts are orthogonal.

👍

StructArrays tries to work with any struct, but I think when collecting an iterable of structs into a StructArray there is a fundamental ambiguity as to whether to "unwrap" or not.

As long as the default is just to unwrap one level I'm fine with it --- we don't (yet) need anything else in IndexedTables/JuliaDB.

For example one can do collect_structarray(itr; initializer = ArrayInitializer(t -> t <: Union{Tuple, NamedTuple, Pair})) to only "unwrap" these types and leave ComplexF64 alone.

I'm indifferent to this for now, but I don't think this could be called the "right" abstraction for this. It seems like in general you want to specify a layout of which fields to unwrap, which has nothing to do with the types of those fields.

A possible way forward would be to first port things to StructArrays (to avoid code duplication and have a more general / widely used array type on which to build on). Then this would allow to use both eager iteration or LazyRow iteration for table operations here and we should just decide what's the default.

👍 Sounds good. I think it's quite likely LazyRow should be the default.

@piever
Copy link
Collaborator Author

piever commented Dec 17, 2018

StructArray overloads getproperty to be able to use s.re to get the array of real components

I don't really like this --- it's basically an implicit map/broadcast of the kind we don't do anymore. A correct expression would be something like map(x->x.re, s), but ideally with more convenient syntax of course. I think we need a better solution here.

I guess my matlab training is showing here. The autovectorization getproperty(s::StructArray, x::Symbol) = getproperty.(s, x) does sound "old Julia", but we may need a better syntax than map(x -> x.re, s) for such a common operation. Regardless, I think the c.columns code for Columns should be cleaned up anyway as it is bad practice to access directly private fields of a type defined in another package (I don't want to break IndexedTables if I decide to rename the field) and we can decide how to best deprecate the getproperty overload in StructArrays if/when a nicer syntax becomes available.

that they should be converted to Tuple or Pair if required (for backward compatibility)

It seems to me we should try to eliminate this. Specially supporting both NamedTuple and Tuple makes at least a bit of sense, since they represent objects with symbol and integer keys, respectively. But I don't know why Pair would be special. Probably best to just use NamedTuple always.

It may be easier to first get tests passing by cheating by having columns preserve the same behavior and then see how easy/hard it is to switch to the NamedTuple strategy, potentially in a new PR.

To construct from existing arrays, that would mean something like StructArray(zip(array1, array2, ...)). To specify the type of struct, it would be e.g. StructArray(zipwith(StructType, a1, a2, ...))

At the moment those two things are StructArray((array1, array2, ..)) (which doesn't conflict with anything) and StructArray{StructType}(a1, a2, ...) which is slightly awkward as you would maybe expect StructArray{StructType}(v) to basically convert the eltype of v rather than creating a StructArray with just one column that is v. I'm not thrilled about the StructArray(zipwith(StructType, a1, a2, ...)) syntax but am not able to propose a better alternative either. Maybe simply StructArray{StructType}((a1, a2, ...))? The rule could be that we also accept tuples or named tuples of vectors because they cannot interfere with the standard array constructors that never accept only one tuple or named tuple argument.

For example one can do collect_structarray(itr; initializer = ArrayInitializer(t -> t <: Union{Tuple, NamedTuple, Pair})) to only "unwrap" these types and leave ComplexF64 alone.

I'm indifferent to this for now, but I don't think this could be called the "right" abstraction for this. It seems like in general you want to specify a layout of which fields to unwrap, which has nothing to do with the types of those fields.

I'm not 100% sure what the right abstraction could be. On the plus side, the code is organized in such a way that the actual code to fill in the columns and the initializer are fully decoupled so it should be straightforward to change the API to define the unwrapping details (unwrapping is only relevant to initialize things, not for the actual collection).

@JeffBezanson
Copy link
Contributor

Regardless, I think the c.columns code for Columns should be cleaned up anyway as it is bad practice to access directly private fields of a type defined in another package

Totally agree. It would still be nice to have a way to get a container of the columns though --- one issue with current approach is that for indexing and iteration a StructArray acts like a container of structs, but for properties it acts like a container of arrays. We should have a function to explicitly obtain a container of columns.

Maybe simply StructArray{StructType}((a1, a2, ...))?
The rule could be that we also accept tuples or named tuples of vectors because they cannot interfere with the standard array constructors that never accept only one tuple or named tuple argument.

Yes, that's an improvement for now. It's quite possible that eventually Array((1,2,3)) will give [1,2,3] though.

@piever
Copy link
Collaborator Author

piever commented Dec 18, 2018

It would still be nice to have a way to get a container of the columns though --- one issue with current approach is that for indexing and iteration a StructArray acts like a container of structs, but for properties it acts like a container of arrays. We should have a function to explicitly obtain a container of columns.

Sounds reasonable, but then I think columns should return that (as it has been designed together with rows to flip between "container of rows" to "container of columns"). In practice, we could for now keep the status quo of columns returning a NamedTuple when eltype is NamedTuple, Tuple when eltype is Tuple and Pair when eltype is Pair and as soon as we have a clean idea of what the "columns container" is, we could generalize to that in the case of a general custom structure. Note that Tuple, NamedTuple and Pair sound like a random assortment of special cases, but they are actually the 3 structures that IndexedTables uses in combination StructArrays. We use Tuple for unnamed columns, NamedTuple for named columns and Pair to pass a pair of columns of which we assume the first one to be sorted, so for example there is a special method for table(c::Columns{<:Pair}) to keep columns(c).first as primary keys. This is quite useful as we use it to implement table operations that return an iterator sorted on some keys (for example, groupby).

fieldarrays instead would stay as the actual accessor function (meaning, just get the value of that field in a StructArray, which - as an implementation detail - is always a NamedTuple).

@piever piever changed the title Very WIP (DO NOT MERGE!): attempt to port to StructArrays Port to StructArrays Dec 18, 2018
end
elseif copy
cs = Base.copy(cs)
cs = copyto!(similar(cs), cs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this change is necessary? Shouldn't copy work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using copy would create a subtle bug. This code expects similar and copy to return things of the same value (for index and index_buffer for example). However this is in general false for arrays. copy(1:3) isa UnitRange whereas similar(1:3) isa Array. This used to work with Columns because Columns has no custom method for copy, so it would default to copymutable(x) = copyto!(similar(x), x) which would in turn make all columns mutable. copy for a StructArray is instead defined as column-wise copy (which I think is correct).

I think copymutable is the correct choice here in general as we want to be able to modify the indexing arrays as well as the buffer arrays, but I don't think copying a StructArray should in general make the field arrays mutable.

@JeffBezanson
Copy link
Contributor

I think columns should return that

👍

Thanks for the explanation. The Pair behavior does seem a little opaque to me; would be nice to clean that up at some point (not this PR of course). For example is it possible to return an IndexedTable instead? That also conveys the first column being sorted.

@piever
Copy link
Collaborator Author

piever commented Dec 19, 2018

The Pair behavior does seem a little opaque to me; would be nice to clean that up at some point (not this PR of course). For example is it possible to return an IndexedTable instead? That also conveys the first column being sorted.

I've opened an issue to discuss the question in general at JuliaData/Tables.jl#52. Returning a table is also possible, but I would like to somehow have the concept of an iterator where some of it is sorted. For example I'd like to do Iterators.filter(f, iter) and not lose the information that some fields were sorted. This should allow to compose data pipelines nicely where each operation would know the sorting of the previous one without the need to materialize into an IndexedTable. Maybe one could define pkeys for iterators (not sure what the default value should be but could be overloaded to return the iterator of primary keys if sorting is known) or even keydata that would iterate pairs of keys and non-keys. Then pkeys(iter::Base.Iterators.Filter) could be defined to use pkeys from the parent iterator.

I would love to have feedback on this issue but I'd recommend continuing the discussion at JuliaData/Tables.jl#52 as I think it's relevant in general for the tabular data ecosystem.

@joshday joshday merged commit 938452e into master Dec 20, 2018
@joshday joshday deleted the pv/structarray branch December 20, 2018 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants