Tables Integration and transition to Union{T,Missing}

- Fallback table constructor now uses Tables - Replaces DataValue with Union{T,Missing}, DataValueArray with Array{Union{T,Missing}} - dropna -> dropmissing - Added special selector Type, e.g. select(t, String)
JuliaData · Dec 12, 2018 · a5e9c32 · a5e9c32
1 parent cf97826
commit a5e9c32
Show file tree

Hide file tree

Showing 23 changed files with 281 additions and 231 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -12,3 +12,9 @@
 - **(feature)** - `collect_columns` function to collect an iterator of tuples to `Columns` object. (#135)
 - **(bugfix)** use `collect_columns` to implement `map`, `groupreduce` and `groupjoin` (#150) to not depend on type inference. Works in many more cases.
 - **(feature)** - `view` works with logical indexes now (#134)
+
+
+## v0.9.0
+
+- **(breaking)** Switch from DataValues to Missing.  Related: `dropna` has been changed to `dropmissing`.
+- **(breaking)** Depend on OnlineStatsBase rather than OnlineStats. 
diff --git a/README.md b/README.md
@@ -10,8 +10,28 @@ be used on its own for efficient in-memory data processing and analytics.
 
 ## Data Structures 
 
-- **The two table types in IndexedTables differ in how data is accessed.**
-- **There is no performance difference between table types for operations such as selecting, filtering, and map/reduce.**
+IndexedTables offers two data structures: `IndexedTable` and `NDSparse`.
+
+- **Both types store data _in columns_**.
+- **`IndexedTable` and `NDSparse` differ mainly in how data is accessed.**
+- **Both types have equal performance for Table operations (`select`, `filter`, etc.).** 
+
+
+## Quickstart
+
+```
+using Pkg
+Pkg.add("IndexedTables")
+using IndexedTables
+
+t = table((x = 1:100, y = randn(100)))
+
+select(t, :x)
+
+filter(row -> row.y > 0, t)
+```
+
+## `IndexedTable` vs. `NDSparse`
 
 First let's create some data to work with.
 
@@ -22,18 +42,18 @@ city = vcat(fill("New York", 3), fill("Boston", 3))
 
 dates = repeat(Date(2016,7,6):Day(1):Date(2016,7,8), 2)
 
-values = [91, 89, 91, 95, 83, 76]
+vals = [91, 89, 91, 95, 83, 76]
 ```
 
-### Table
+### IndexedTable
 
-- Data is accessed as a Vector of NamedTuples.  
-- Sorted by primary key(s), `pkey`.
+- (Optionally) Sorted by primary key(s), `pkey`.
+- Data is accessed as a Vector of NamedTuples.
 
 ```julia
 using IndexedTables
 
-julia> t1 = table((city = city, dates = dates, values = values); pkey = [:city, :dates])
+julia> t1 = table((city = city, dates = dates, values = vals); pkey = [:city, :dates])
 Table with 6 rows, 3 columns:
 city        dates       values
 ──────────────────────────────
@@ -46,18 +66,15 @@ city        dates       values
 
 julia> t1[1]
 (city = "Boston", dates = 2016-07-06, values = 95)
-
-julia> first(t1)
-(city = "Boston", dates = 2016-07-06, values = 95)
 ```
 
 ### NDSparse
 
-- Data is accessed as an N-dimensional sparse array with arbitrary indexes.
 - Sorted by index variables (first argument).
+- Data is accessed as an N-dimensional sparse array with arbitrary indexes.
 
 ```julia
-julia> t2 = ndsparse(@NT(city=city, dates=dates), @NT(value=values))
+julia> t2 = ndsparse((city=city, dates=dates), (value=vals,))
 2-d NDSparse with 6 values (1 field named tuples):
 city        dates      │ value
 ───────────────────────┼──────
@@ -70,26 +87,8 @@ city        dates      │ value
 
 julia> t2["Boston", Date(2016, 7, 6)]
 (value = 95)
-
-julia> first(t2)
-(value = 95)
-```
-
-As with other multi-dimensional arrays, dimensions can be permuted to change the sort order:
-
-```julia
-julia> permutedims(t2, [2,1])
-2-d NDSparse with 6 values (1 field named tuples):
-dates       city       │ value
-───────────────────────┼──────
-2016-07-06  "Boston"   │ 95
-2016-07-06  "New York" │ 91
-2016-07-07  "Boston"   │ 83
-2016-07-07  "New York" │ 89
-2016-07-08  "Boston"   │ 76
-2016-07-08  "New York" │ 91
 ```
 
 ## Get started
 
-For more information, check out the [JuliaDB API Reference](http://juliadb.org/latest/api/datastructures.html).
+For more information, check out the [JuliaDB Documentation](http://juliadb.org/latest/index.html).
diff --git a/REQUIRE b/REQUIRE
@@ -5,4 +5,4 @@ WeakRefStrings 0.4.4
 TableTraits 0.3.0
 TableTraitsUtils 0.2.0
 IteratorInterfaceExtensions 0.1.0
-DataValues
+Tables
diff --git a/src/IndexedTables.jl b/src/IndexedTables.jl
@@ -4,13 +4,15 @@ using PooledArrays, SparseArrays, Statistics, WeakRefStrings, TableTraits,
     TableTraitsUtils, IteratorInterfaceExtensions
 
 using OnlineStatsBase: OnlineStat, fit!
-using DataValues: DataValues, DataValue, NA, isna, DataValueArray
-import DataValues: dropna
+import Tables
 
 import Base:
     show, eltype, length, getindex, setindex!, ndims, map, convert, keys, values,
     ==, broadcast, empty!, copy, similar, sum, merge, merge!, mapslices,
-    permutedims, sort, sort!, iterate, pairs
+    permutedims, sort, sort!, iterate, pairs, reduce, push!, size, permute!, issorted, 
+    sortperm, summary, resize!, vcat, append!, copyto!, view, tail,
+    tuple_type_cons, tuple_type_head, tuple_type_tail, in, convert
+
 
 #-----------------------------------------------------------------------# exports
 export 
@@ -20,20 +22,20 @@ export
     AbstractNDSparse, All, ApplyColwise, Between, ColDict, Columns, IndexedTable,
     Keys, NDSparse, NextTable, Not,
     # functions
-    aggregate, aggregate!, aggregate_vec, antijoin, asofjoin, collect_columns, colnames,
-    column, columns, convertdim, dimlabels, dropna, flatten, flush!, groupby, groupjoin,
+    aggregate!, antijoin, asofjoin, collect_columns, colnames,
+    column, columns, convertdim, dimlabels, flatten, flush!, groupby, groupjoin,
     groupreduce, innerjoin, insertafter!, insertbefore!, insertcol, insertcolafter, 
     insertcolbefore, leftgroupjoin, leftjoin, map_rows, naturalgroupjoin, naturaljoin,
     ncols, ndsparse, outergroupjoin, outerjoin, pkeynames, pkeys, popcol, pushcol,
     reducedim_vec, reindex, renamecol, rows, select, selectkeys, selectvalues, setcol,
-    stack, summarize, table, unstack, update!, where
+    stack, summarize, table, unstack, update!, where, dropmissing, dropna
 
 const Tup = Union{Tuple,NamedTuple}
 const DimName = Union{Int,Symbol}
 
 include("utils.jl")
 include("columns.jl")
-include("table.jl")
+include("indexedtable.jl")
 include("ndsparse.jl")
 include("collect.jl")
 
@@ -73,7 +75,8 @@ include("flatten.jl")
 include("join.jl")
 include("reshape.jl")
 
-# TableTraits.jl integration
+# TableTraits/Tables integration
 include("tabletraits.jl")
+include("tables.jl")
 
 end # module
diff --git a/src/collect.jl b/src/collect.jl
@@ -1,8 +1,5 @@
 _is_subtype(::Type{S}, ::Type{T}) where {S, T} = promote_type(S, T) == T
 
-dataarrayof(::Type{<:DataValue{T}}, len) where {T} = DataValueArray{T,1}(len)
-dataarrayof(::Type{T}, len) where {T} = Vector{T}(undef, len)
-
 """
     collect_columns(itr)
 
@@ -166,7 +163,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
         idx = findall(collect(!(s <: t) for (s, t) in zip(sp, tp)))
         new = dest
         for l in idx
-            newcol = dataarrayof(promote_type(sp[l], tp[l]), length(dest))
+            newcol = Vector{promote_type(sp[l], tp[l])}(undef, length(dest))
             copyto!(newcol, 1, column(dest, l), 1, i-1)
             new = setcol(new, l, newcol)
         end
@@ -175,7 +172,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
 end
 
 function widencolumns(dest, i, el::S, ::Type{T}) where{S, T}
-    new = dataarrayof(promote_type(S, T), length(dest))
+    new = Vector{promote_type(S, T)}(undef, length(dest))
     copyto!(new, 1, dest, 1, i-1)
     new
 end

diff --git a/src/columns.jl b/src/columns.jl
@@ -1,7 +1,3 @@
-import Base:
-    push!, size, sort, sort!, permute!, issorted, sortperm,
-    summary, resize!, vcat, append!, copyto!, view
-
 """
 Wrapper around a (named) tuple of Vectors that acts like a Vector of (named) tuples.
 
@@ -97,7 +93,6 @@ available selection options and syntax.
 """
 function columns end
 
-columns(c) = error("no columns defined for $(typeof(c))")
 columns(c::Columns) = c.columns
 
 # Array-like API
@@ -110,17 +105,14 @@ length(c::Columns{<:Pair, <:Pair}) = length(c.columns.first)
 ndims(c::Columns) = 1
 
 """
-`ncols(itr)`
+    ncols(itr)
 
 Returns the number of columns in `itr`.
 
 # Examples
 
-    ncols([1,2,3])
-    ncols(rows(([1,2,3],[4,5,6])))
-    ncols(table(([1,2,3],[4,5,6])))
-    ncols(table(@NT(x=[1,2,3],y=[4,5,6])))
-    ncols(ndsparse(d, [7,8,9]))
+    ncols([1,2,3]) == 1
+    ncols(rows(([1,2,3],[4,5,6]))) == 2
 """
 function ncols end
 ncols(c::Columns) = fieldcount(typeof(c.columns))
@@ -184,21 +176,7 @@ resize!(I::Columns, n::Int) = (foreach(c->resize!(c,n), I.columns); I)
 
 _sizehint!(c::Columns, n::Integer) = (foreach(c->_sizehint!(c,n), c.columns); c)
 
-function ==(x::Columns, y::Columns)
-    nc = length(x.columns)
-    length(y.columns) == nc || return false
-    fieldnames(eltype(x)) == fieldnames(eltype(y)) || return false
-    n = length(x)
-    length(y) == n || return false
-    for i in 1:nc
-        x.columns[i] == y.columns[i] || return false
-    end
-    return true
-end
-
-==(x::Columns{<:Pair}, y::Columns) = false
-==(x::Columns, y::Columns{<:Pair}) = false
-==(x::Columns{<:Pair}, y::Columns{<:Pair}) = (x.columns.first == y.columns.first) && (x.columns.second == y.columns.second)
+==(x::Columns, y::Columns) = x.columns == y.columns
 
 function _strip_pair(c::Columns{<:Pair})
     f, s = map(columns, c.columns)
@@ -368,7 +346,7 @@ end
 # map
 
 """
-`map_rows(f, c...)`
+    map_rows(f, c...)
 
 Transform collection `c` by applying `f` to each element. For multiple collection arguments, apply `f`
 elementwise. Collect output as `Columns` if `f` returns
@@ -449,7 +427,7 @@ struct Between{T1 <: Union{Int, Symbol}, T2 <: Union{Int, Symbol}}
     last::T2
 end
 
-const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex}
+const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex, Type}
 
 hascolumns(t, s) = true
 hascolumns(t, s::Symbol) = s in colnames(t)
@@ -458,6 +436,7 @@ hascolumns(t, s::Tuple) = all(hascolumns(t, x) for x in s)
 hascolumns(t, s::Not) = hascolumns(t, s.cols)
 hascolumns(t, s::Between) = hascolumns(t, s.first) && hascolumns(t, s.last)
 hascolumns(t, s::All) = all(hascolumns(t, x) for x in s.cols)
+hascolumns(t, s::Type) = any(x -> eltype(x) <: s, columns(t))
 
 lowerselection(t, s)                     = s
 lowerselection(t, s::Union{Int, Symbol}) = colindex(t, s)
@@ -467,6 +446,7 @@ lowerselection(t, s::Keys)               = lowerselection(t, IndexedTables.pkeyn
 lowerselection(t, s::Between)            = Tuple(colindex(t, s.first):colindex(t, s.last))
 lowerselection(t, s::Function)           = colindex(t, Tuple(filter(s, collect(colnames(t)))))
 lowerselection(t, s::Regex)              = lowerselection(t, x -> occursin(s, string(x)))
+lowerselection(t, s::Type)               = Tuple(findall(x -> eltype(x) <: s, columns(t)))
 
 function lowerselection(t, s::All)
     s.cols == () && return lowerselection(t, valuenames(t))

diff --git a/src/table.jl → src/indexedtable.jl b/src/table.jl → src/indexedtable.jl
@@ -1,5 +1,3 @@
-import Base: setindex!, reduce
-
 """
 A permutation
 
@@ -16,7 +14,7 @@ end
 abstract type AbstractIndexedTable end
 
 """
-A tabular data structure that extends [`Columns`](@ref).  Create a `IndexedTable` with the 
+A tabular data structure that extends [`Columns`](@ref).  Create an `IndexedTable` with the 
 [`table`](@ref) function.
 """
 struct IndexedTable{C<:Columns} <: AbstractIndexedTable
@@ -51,7 +49,9 @@ Construct a table from a vector of tuples. See [`rows`](@ref) and [`Columns`](@r
 
 Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.
 
-    table(iter; kw...)
+    table(x; kw...)
+
+Create an `IndexedTable` from any object `x` that follows the `Tables.jl` interface.
 
 
 # Keyword Argument Options:
@@ -353,7 +353,7 @@ function sort!(t::IndexedTable, by...; kwargs...)
 end
 
 """
-    excludecols(itr, cols)
+    excludecols(itr, cols) -> Tuple of Int
 
 Names of all columns in `itr` except `cols`. `itr` can be any of
 `Table`, `NDSparse`, `Columns`, or `AbstractVector`
@@ -369,22 +369,10 @@ Names of all columns in `itr` except `cols`. `itr` can be any of
     excludecols(t, pkeynames(t))
     excludecols([1,2,3], (1,))
 """
-function excludecols(t, cols)
-    if cols isa SpecialSelector
-        return excludecols(t, lowerselection(t, cols))
-    end
-    if !isa(cols, Tuple)
-        return excludecols(t, (cols,))
-    end
-    ns = colnames(t)
-    mask = ones(Bool, length(ns))
-    for c in cols
-        i = colindex(t, c)
-        if i !== 0
-            mask[i] = false
-        end
-    end
-    ((1:length(ns))[mask]...,)
+excludecols(t, cols) = excludecols(t, (cols,))
+excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols))
+function excludecols(t, cols::Tuple) 
+    Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols)))
 end
 
 """

diff --git a/src/indexing.jl b/src/indexing.jl
@@ -19,7 +19,6 @@ _in(x, v::AbstractString) = x == v
 _in(x, v::Symbol) = x === v
 _in(x, v::Number) = isequal(x, v)
 
-import Base: tail
 # test whether row r is within product(idxs...)
 @inline row_in(cs, r::Integer, idxs) = _row_in(cs[1], r, idxs[1], tail(cs), tail(idxs))
 @inline _row_in(c1, r, i1, rI, ri) = _in(c1[r],i1) & _row_in(rI[1], r, ri[1], tail(rI), tail(ri))