Skip to content

Commit

Permalink
Tables Integration and transition to Union{T,Missing}
Browse files Browse the repository at this point in the history
- Fallback table constructor now uses Tables
- Replaces DataValue with Union{T,Missing}, DataValueArray with Array{Union{T,Missing}}
- dropna -> dropmissing
- Added special selector Type, e.g. select(t, String)
  • Loading branch information
joshday authored Dec 12, 2018
1 parent cf97826 commit a5e9c32
Show file tree
Hide file tree
Showing 23 changed files with 281 additions and 231 deletions.
6 changes: 6 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,9 @@
- **(feature)** - `collect_columns` function to collect an iterator of tuples to `Columns` object. (#135)
- **(bugfix)** use `collect_columns` to implement `map`, `groupreduce` and `groupjoin` (#150) to not depend on type inference. Works in many more cases.
- **(feature)** - `view` works with logical indexes now (#134)


## v0.9.0

- **(breaking)** Switch from DataValues to Missing. Related: `dropna` has been changed to `dropmissing`.
- **(breaking)** Depend on OnlineStatsBase rather than OnlineStats.
61 changes: 30 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,28 @@ be used on its own for efficient in-memory data processing and analytics.

## Data Structures

- **The two table types in IndexedTables differ in how data is accessed.**
- **There is no performance difference between table types for operations such as selecting, filtering, and map/reduce.**
IndexedTables offers two data structures: `IndexedTable` and `NDSparse`.

- **Both types store data _in columns_**.
- **`IndexedTable` and `NDSparse` differ mainly in how data is accessed.**
- **Both types have equal performance for Table operations (`select`, `filter`, etc.).**


## Quickstart

```
using Pkg
Pkg.add("IndexedTables")
using IndexedTables
t = table((x = 1:100, y = randn(100)))
select(t, :x)
filter(row -> row.y > 0, t)
```

## `IndexedTable` vs. `NDSparse`

First let's create some data to work with.

Expand All @@ -22,18 +42,18 @@ city = vcat(fill("New York", 3), fill("Boston", 3))

dates = repeat(Date(2016,7,6):Day(1):Date(2016,7,8), 2)

values = [91, 89, 91, 95, 83, 76]
vals = [91, 89, 91, 95, 83, 76]
```

### Table
### IndexedTable

- Data is accessed as a Vector of NamedTuples.
- Sorted by primary key(s), `pkey`.
- (Optionally) Sorted by primary key(s), `pkey`.
- Data is accessed as a Vector of NamedTuples.

```julia
using IndexedTables

julia> t1 = table((city = city, dates = dates, values = values); pkey = [:city, :dates])
julia> t1 = table((city = city, dates = dates, values = vals); pkey = [:city, :dates])
Table with 6 rows, 3 columns:
city dates values
──────────────────────────────
Expand All @@ -46,18 +66,15 @@ city dates values

julia> t1[1]
(city = "Boston", dates = 2016-07-06, values = 95)

julia> first(t1)
(city = "Boston", dates = 2016-07-06, values = 95)
```

### NDSparse

- Data is accessed as an N-dimensional sparse array with arbitrary indexes.
- Sorted by index variables (first argument).
- Data is accessed as an N-dimensional sparse array with arbitrary indexes.

```julia
julia> t2 = ndsparse(@NT(city=city, dates=dates), @NT(value=values))
julia> t2 = ndsparse((city=city, dates=dates), (value=vals,))
2-d NDSparse with 6 values (1 field named tuples):
city dates │ value
───────────────────────┼──────
Expand All @@ -70,26 +87,8 @@ city dates │ value

julia> t2["Boston", Date(2016, 7, 6)]
(value = 95)

julia> first(t2)
(value = 95)
```

As with other multi-dimensional arrays, dimensions can be permuted to change the sort order:

```julia
julia> permutedims(t2, [2,1])
2-d NDSparse with 6 values (1 field named tuples):
dates city │ value
───────────────────────┼──────
2016-07-06 "Boston"95
2016-07-06 "New York"91
2016-07-07 "Boston"83
2016-07-07 "New York"89
2016-07-08 "Boston"76
2016-07-08 "New York"91
```

## Get started

For more information, check out the [JuliaDB API Reference](http://juliadb.org/latest/api/datastructures.html).
For more information, check out the [JuliaDB Documentation](http://juliadb.org/latest/index.html).
2 changes: 1 addition & 1 deletion REQUIRE
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ WeakRefStrings 0.4.4
TableTraits 0.3.0
TableTraitsUtils 0.2.0
IteratorInterfaceExtensions 0.1.0
DataValues
Tables
19 changes: 11 additions & 8 deletions src/IndexedTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@ using PooledArrays, SparseArrays, Statistics, WeakRefStrings, TableTraits,
TableTraitsUtils, IteratorInterfaceExtensions

using OnlineStatsBase: OnlineStat, fit!
using DataValues: DataValues, DataValue, NA, isna, DataValueArray
import DataValues: dropna
import Tables

import Base:
show, eltype, length, getindex, setindex!, ndims, map, convert, keys, values,
==, broadcast, empty!, copy, similar, sum, merge, merge!, mapslices,
permutedims, sort, sort!, iterate, pairs
permutedims, sort, sort!, iterate, pairs, reduce, push!, size, permute!, issorted,
sortperm, summary, resize!, vcat, append!, copyto!, view, tail,
tuple_type_cons, tuple_type_head, tuple_type_tail, in, convert


#-----------------------------------------------------------------------# exports
export
Expand All @@ -20,20 +22,20 @@ export
AbstractNDSparse, All, ApplyColwise, Between, ColDict, Columns, IndexedTable,
Keys, NDSparse, NextTable, Not,
# functions
aggregate, aggregate!, aggregate_vec, antijoin, asofjoin, collect_columns, colnames,
column, columns, convertdim, dimlabels, dropna, flatten, flush!, groupby, groupjoin,
aggregate!, antijoin, asofjoin, collect_columns, colnames,
column, columns, convertdim, dimlabels, flatten, flush!, groupby, groupjoin,
groupreduce, innerjoin, insertafter!, insertbefore!, insertcol, insertcolafter,
insertcolbefore, leftgroupjoin, leftjoin, map_rows, naturalgroupjoin, naturaljoin,
ncols, ndsparse, outergroupjoin, outerjoin, pkeynames, pkeys, popcol, pushcol,
reducedim_vec, reindex, renamecol, rows, select, selectkeys, selectvalues, setcol,
stack, summarize, table, unstack, update!, where
stack, summarize, table, unstack, update!, where, dropmissing, dropna

const Tup = Union{Tuple,NamedTuple}
const DimName = Union{Int,Symbol}

include("utils.jl")
include("columns.jl")
include("table.jl")
include("indexedtable.jl")
include("ndsparse.jl")
include("collect.jl")

Expand Down Expand Up @@ -73,7 +75,8 @@ include("flatten.jl")
include("join.jl")
include("reshape.jl")

# TableTraits.jl integration
# TableTraits/Tables integration
include("tabletraits.jl")
include("tables.jl")

end # module
7 changes: 2 additions & 5 deletions src/collect.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
_is_subtype(::Type{S}, ::Type{T}) where {S, T} = promote_type(S, T) == T

dataarrayof(::Type{<:DataValue{T}}, len) where {T} = DataValueArray{T,1}(len)
dataarrayof(::Type{T}, len) where {T} = Vector{T}(undef, len)

"""
collect_columns(itr)
Expand Down Expand Up @@ -166,7 +163,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
idx = findall(collect(!(s <: t) for (s, t) in zip(sp, tp)))
new = dest
for l in idx
newcol = dataarrayof(promote_type(sp[l], tp[l]), length(dest))
newcol = Vector{promote_type(sp[l], tp[l])}(undef, length(dest))
copyto!(newcol, 1, column(dest, l), 1, i-1)
new = setcol(new, l, newcol)
end
Expand All @@ -175,7 +172,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
end

function widencolumns(dest, i, el::S, ::Type{T}) where{S, T}
new = dataarrayof(promote_type(S, T), length(dest))
new = Vector{promote_type(S, T)}(undef, length(dest))
copyto!(new, 1, dest, 1, i-1)
new
end
Expand Down
36 changes: 8 additions & 28 deletions src/columns.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
import Base:
push!, size, sort, sort!, permute!, issorted, sortperm,
summary, resize!, vcat, append!, copyto!, view

"""
Wrapper around a (named) tuple of Vectors that acts like a Vector of (named) tuples.
Expand Down Expand Up @@ -97,7 +93,6 @@ available selection options and syntax.
"""
function columns end

columns(c) = error("no columns defined for $(typeof(c))")
columns(c::Columns) = c.columns

# Array-like API
Expand All @@ -110,17 +105,14 @@ length(c::Columns{<:Pair, <:Pair}) = length(c.columns.first)
ndims(c::Columns) = 1

"""
`ncols(itr)`
ncols(itr)
Returns the number of columns in `itr`.
# Examples
ncols([1,2,3])
ncols(rows(([1,2,3],[4,5,6])))
ncols(table(([1,2,3],[4,5,6])))
ncols(table(@NT(x=[1,2,3],y=[4,5,6])))
ncols(ndsparse(d, [7,8,9]))
ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2
"""
function ncols end
ncols(c::Columns) = fieldcount(typeof(c.columns))
Expand Down Expand Up @@ -184,21 +176,7 @@ resize!(I::Columns, n::Int) = (foreach(c->resize!(c,n), I.columns); I)

_sizehint!(c::Columns, n::Integer) = (foreach(c->_sizehint!(c,n), c.columns); c)

function ==(x::Columns, y::Columns)
nc = length(x.columns)
length(y.columns) == nc || return false
fieldnames(eltype(x)) == fieldnames(eltype(y)) || return false
n = length(x)
length(y) == n || return false
for i in 1:nc
x.columns[i] == y.columns[i] || return false
end
return true
end

==(x::Columns{<:Pair}, y::Columns) = false
==(x::Columns, y::Columns{<:Pair}) = false
==(x::Columns{<:Pair}, y::Columns{<:Pair}) = (x.columns.first == y.columns.first) && (x.columns.second == y.columns.second)
==(x::Columns, y::Columns) = x.columns == y.columns

function _strip_pair(c::Columns{<:Pair})
f, s = map(columns, c.columns)
Expand Down Expand Up @@ -368,7 +346,7 @@ end
# map

"""
`map_rows(f, c...)`
map_rows(f, c...)
Transform collection `c` by applying `f` to each element. For multiple collection arguments, apply `f`
elementwise. Collect output as `Columns` if `f` returns
Expand Down Expand Up @@ -449,7 +427,7 @@ struct Between{T1 <: Union{Int, Symbol}, T2 <: Union{Int, Symbol}}
last::T2
end

const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex}
const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex, Type}

hascolumns(t, s) = true
hascolumns(t, s::Symbol) = s in colnames(t)
Expand All @@ -458,6 +436,7 @@ hascolumns(t, s::Tuple) = all(hascolumns(t, x) for x in s)
hascolumns(t, s::Not) = hascolumns(t, s.cols)
hascolumns(t, s::Between) = hascolumns(t, s.first) && hascolumns(t, s.last)
hascolumns(t, s::All) = all(hascolumns(t, x) for x in s.cols)
hascolumns(t, s::Type) = any(x -> eltype(x) <: s, columns(t))

lowerselection(t, s) = s
lowerselection(t, s::Union{Int, Symbol}) = colindex(t, s)
Expand All @@ -467,6 +446,7 @@ lowerselection(t, s::Keys) = lowerselection(t, IndexedTables.pkeyn
lowerselection(t, s::Between) = Tuple(colindex(t, s.first):colindex(t, s.last))
lowerselection(t, s::Function) = colindex(t, Tuple(filter(s, collect(colnames(t)))))
lowerselection(t, s::Regex) = lowerselection(t, x -> occursin(s, string(x)))
lowerselection(t, s::Type) = Tuple(findall(x -> eltype(x) <: s, columns(t)))

function lowerselection(t, s::All)
s.cols == () && return lowerselection(t, valuenames(t))
Expand Down
30 changes: 9 additions & 21 deletions src/table.jl → src/indexedtable.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
import Base: setindex!, reduce

"""
A permutation
Expand All @@ -16,7 +14,7 @@ end
abstract type AbstractIndexedTable end

"""
A tabular data structure that extends [`Columns`](@ref). Create a `IndexedTable` with the
A tabular data structure that extends [`Columns`](@ref). Create an `IndexedTable` with the
[`table`](@ref) function.
"""
struct IndexedTable{C<:Columns} <: AbstractIndexedTable
Expand Down Expand Up @@ -51,7 +49,9 @@ Construct a table from a vector of tuples. See [`rows`](@ref) and [`Columns`](@r
Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.
table(iter; kw...)
table(x; kw...)
Create an `IndexedTable` from any object `x` that follows the `Tables.jl` interface.
# Keyword Argument Options:
Expand Down Expand Up @@ -353,7 +353,7 @@ function sort!(t::IndexedTable, by...; kwargs...)
end

"""
excludecols(itr, cols)
excludecols(itr, cols) -> Tuple of Int
Names of all columns in `itr` except `cols`. `itr` can be any of
`Table`, `NDSparse`, `Columns`, or `AbstractVector`
Expand All @@ -369,22 +369,10 @@ Names of all columns in `itr` except `cols`. `itr` can be any of
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))
"""
function excludecols(t, cols)
if cols isa SpecialSelector
return excludecols(t, lowerselection(t, cols))
end
if !isa(cols, Tuple)
return excludecols(t, (cols,))
end
ns = colnames(t)
mask = ones(Bool, length(ns))
for c in cols
i = colindex(t, c)
if i !== 0
mask[i] = false
end
end
((1:length(ns))[mask]...,)
excludecols(t, cols) = excludecols(t, (cols,))
excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols))
function excludecols(t, cols::Tuple)
Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols)))
end

"""
Expand Down
1 change: 0 additions & 1 deletion src/indexing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ _in(x, v::AbstractString) = x == v
_in(x, v::Symbol) = x === v
_in(x, v::Number) = isequal(x, v)

import Base: tail
# test whether row r is within product(idxs...)
@inline row_in(cs, r::Integer, idxs) = _row_in(cs[1], r, idxs[1], tail(cs), tail(idxs))
@inline _row_in(c1, r, i1, rI, ri) = _in(c1[r],i1) & _row_in(rI[1], r, ri[1], tail(rI), tail(ri))
Expand Down
Loading

0 comments on commit a5e9c32

Please sign in to comment.