Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced transformation examples #3433

Merged
merged 5 commits into from
Sep 22, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 93 additions & 2 deletions docs/src/man/working_with_dataframes.md
Original file line number Diff line number Diff line change
Expand Up @@ -812,14 +812,21 @@ julia> df = DataFrame(A=1:4, B=4.0:-1.0:1.0)
3 │ 3 2.0
4 │ 4 1.0

julia> combine(df, names(df) .=> sum)
julia> combine(df, All() .=> sum)
1×2 DataFrame
Row │ A_sum B_sum
│ Int64 Float64
─────┼────────────────
1 │ 10 10.0

julia> combine(df, names(df) .=> sum, names(df) .=> prod)
julia> combine(df, All() .=> sum, All() .=> prod)
1×4 DataFrame
Row │ A_sum B_sum A_prod B_prod
│ Int64 Float64 Int64 Float64
─────┼─────────────────────────────────
1 │ 10 10.0 24 24.0

julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcasting
1×4 DataFrame
Row │ A_sum B_sum A_prod B_prod
│ Int64 Float64 Int64 Float64
Expand All @@ -830,6 +837,90 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
If you would prefer the result to have the same number of rows as the source
data frame, use `select` instead of `combine`.

In the remainder of this section we will discuss more advanced topics related
to the operation specification syntax, so you may decide to skip them if you
want to focus on the most common usage patterns.

A `DataFrame` can store values of any type as its columns, for example
below we show how one can store a `Tuple`:

```
julia> df2 = combine(df, All() .=> extrema)
1×2 DataFrame
Row │ A_extrema B_extrema
│ Tuple… Tuple…
─────┼───────────────────────
1 │ (1, 4) (1.0, 4.0)
```

Later you might want to expand the tuples into separate columns storing the computed
minima and maxima. This can be achieved by passing multiple columns for the output.
Here is an example of how this can be done by writing the column names by-hand for a single
input column:

```
julia> combine(df2, "A_extrema" => identity => ["A_min", "A_max"])
1×2 DataFrame
Row │ A_min A_max
│ Int64 Int64
─────┼──────────────
1 │ 1 4
```

You can extend it to handling all columns in `df2` using broadcasting:

```
julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max"]])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

This approach works, but can be improved. Instead of writing all the column names
manually we can instead use a function as a way to specify target column names
based on source column names:

```
julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

Note that in this example we needed to pass `identity` explicitly as with
bkamins marked this conversation as resolved.
Show resolved Hide resolved
`All() => (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
bkamins marked this conversation as resolved.
Show resolved Hide resolved
treated as a transformation and not as a rule for target column names generation.

You might want to perform the transformation of the source data frame into the result
we have just shown in one step. This can be achieved with the following expression:

```
julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function,
which in our case is a tuple of numbers, and tries to expand it assuming that each produced value represents one row,
so one gets an error:

```
julia> combine(df, All() .=> extrema .=> [c -> c .* ["_min", "_max"]])
ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values,
which doesn't satisfy the Tables.jl `AbstractRow` interface
```

Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one
wants to store one row, however, in general it could be another iterator (e.g. a tuple).

## Handling of Columns Stored in a `DataFrame`

Functions that transform a `DataFrame` to produce a
Expand Down
Loading