diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md index e65d0ab03..a74298c18 100755 --- a/docs/src/man/working_with_dataframes.md +++ b/docs/src/man/working_with_dataframes.md @@ -812,14 +812,21 @@ julia> df = DataFrame(A=1:4, B=4.0:-1.0:1.0) 3 │ 3 2.0 4 │ 4 1.0 -julia> combine(df, names(df) .=> sum) +julia> combine(df, All() .=> sum) 1×2 DataFrame Row │ A_sum B_sum │ Int64 Float64 ─────┼──────────────── 1 │ 10 10.0 -julia> combine(df, names(df) .=> sum, names(df) .=> prod) +julia> combine(df, All() .=> sum, All() .=> prod) +1×4 DataFrame + Row │ A_sum B_sum A_prod B_prod + │ Int64 Float64 Int64 Float64 +─────┼───────────────────────────────── + 1 │ 10 10.0 24 24.0 + +julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcasting 1×4 DataFrame Row │ A_sum B_sum A_prod B_prod │ Int64 Float64 Int64 Float64 @@ -830,6 +837,90 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod) If you would prefer the result to have the same number of rows as the source data frame, use `select` instead of `combine`. +In the remainder of this section we will discuss more advanced topics related +to the operation specification syntax, so you may decide to skip them if you +want to focus on the most common usage patterns. + +A `DataFrame` can store values of any type as its columns, for example +below we show how one can store a `Tuple`: + +``` +julia> df2 = combine(df, All() .=> extrema) +1×2 DataFrame + Row │ A_extrema B_extrema + │ Tuple… Tuple… +─────┼─────────────────────── + 1 │ (1, 4) (1.0, 4.0) +``` + +Later you might want to expand the tuples into separate columns storing the computed +minima and maxima. This can be achieved by passing multiple columns for the output. +Here is an example of how this can be done by writing the column names by-hand for a single +input column: + +``` +julia> combine(df2, "A_extrema" => identity => ["A_min", "A_max"]) +1×2 DataFrame + Row │ A_min A_max + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 +``` + +You can extend it to handling all columns in `df2` using broadcasting: + +``` +julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max"]]) +1×4 DataFrame + Row │ A_min A_max B_min B_max + │ Int64 Int64 Float64 Float64 +─────┼──────────────────────────────── + 1 │ 1 4 1.0 4.0 +``` + +This approach works, but can be improved. Instead of writing all the column names +manually we can instead use a function as a way to specify target column names +based on source column names: + +``` +julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"]) +1×4 DataFrame + Row │ A_min A_max B_min B_max + │ Int64 Int64 Float64 Float64 +─────┼──────────────────────────────── + 1 │ 1 4 1.0 4.0 +``` + +Note that in this example we needed to pass `identity` explicitly since with +`All() => (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be +treated as a transformation and not as a rule for target column names generation. + +You might want to perform the transformation of the source data frame into the result +we have just shown in one step. This can be achieved with the following expression: + +``` +julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"]) +1×4 DataFrame + Row │ A_min A_max B_min B_max + │ Int64 Int64 Float64 Float64 +─────┼──────────────────────────────── + 1 │ 1 4 1.0 4.0 +``` + +Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification. +Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function, +which in our case is a tuple of numbers, and tries to expand it assuming that each produced value represents one row, +so one gets an error: + +``` +julia> combine(df, All() .=> extrema .=> [c -> c .* ["_min", "_max"]]) +ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values, +which doesn't satisfy the Tables.jl `AbstractRow` interface +``` + +Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one +wants to store one row, however, in general it could be another iterator (e.g. a tuple). + ## Handling of Columns Stored in a `DataFrame` Functions that transform a `DataFrame` to produce a