From a2e49bcd40e83cd44420651ce268284d4041f208 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Wed, 27 Mar 2024 20:34:46 +0100
Subject: [PATCH 1/4] advanced transformation examples

---
 docs/src/man/working_with_dataframes.md | 66 +++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md
index e65d0ab032..9f3f037bd8 100755
--- a/docs/src/man/working_with_dataframes.md
+++ b/docs/src/man/working_with_dataframes.md
@@ -830,6 +830,72 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
 If you would prefer the result to have the same number of rows as the source
 data frame, use `select` instead of `combine`.
 
+Note that a `DataFrame` can store values of any type as its columns, for example
+below we show how one can store a `Tuple`:
+
+```
+julia> df2 = combine(df, All() .=> extrema)
+1×2 DataFrame
+ Row │ A_extrema  B_extrema
+     │ Tuple…     Tuple…
+─────┼───────────────────────
+   1 │ (1, 4)     (1.0, 4.0)
+```
+
+Later you might want to expand the tuples into separate columns storing the computed
+minima and maxima. This can be achieved by passing multiple columns for the output.
+In the example below we show how this can be done in combination with a function
+so that we can generate target column names conditional on source column names:
+
+```
+julia> combine(df2, All() .=> identity .=> [c -> first(c) .* ["_min", "_max"]])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+Note that in this example we needed to pass `identity` explicitly as otherwise the
+functions generated with `c -> first(c) .* ["_min", "_max"]` would be treated as transformations
+and not as rules for target column names generation.
+
+You might want to perform the transformation of the source data frame into the result
+we have just shown in one step. This can be achieved with the following expression:
+
+```
+julia> combine(df, All() .=> Ref∘extrema .=> [c -> c .* ["_min", "_max"]])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
+The reason why this is needed is that instead `combine` iterates the contents of the value returned
+by the operation specification function and tries to expand it, which in our case is a tuple of numbers,
+so one gets an error:
+
+```
+julia> combine(df, names(df) .=> extrema .=> [c -> c .* ["_min", "_max"]])
+ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values,
+which doesn't satisfy the Tables.jl `AbstractRow` interface
+```
+
+Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one
+wants to store one value, however, in general it could be another iterator. Here is an example
+when the tuple returned by `extrema` is wrapped in a `Tuple`, producing the same result:
+
+```
+julia> combine(df, names(df) .=> tuple∘extrema .=> [c -> c .* ["_min", "_max"]])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
 ## Handling of Columns Stored in a `DataFrame`
 
 Functions that transform a `DataFrame` to produce a

From 8bdecf49c048bdac4642b5f8d51ed34b3bc3686b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Sat, 13 Apr 2024 20:41:40 +0200
Subject: [PATCH 2/4] apply review suggestions

---
 docs/src/man/working_with_dataframes.md | 73 +++++++++++++++++--------
 1 file changed, 49 insertions(+), 24 deletions(-)

diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md
index 9f3f037bd8..ac13eeb6d5 100755
--- a/docs/src/man/working_with_dataframes.md
+++ b/docs/src/man/working_with_dataframes.md
@@ -812,14 +812,21 @@ julia> df = DataFrame(A=1:4, B=4.0:-1.0:1.0)
    3 │     3      2.0
    4 │     4      1.0
 
-julia> combine(df, names(df) .=> sum)
+julia> combine(df, All() .=> sum)
 1×2 DataFrame
  Row │ A_sum  B_sum
      │ Int64  Float64
 ─────┼────────────────
    1 │    10     10.0
 
-julia> combine(df, names(df) .=> sum, names(df) .=> prod)
+julia> combine(df, All() .=> sum, All() .=> prod)
+1×4 DataFrame
+ Row │ A_sum  B_sum    A_prod  B_prod
+     │ Int64  Float64  Int64   Float64
+─────┼─────────────────────────────────
+   1 │    10     10.0      24     24.0
+
+julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcasting
 1×4 DataFrame
  Row │ A_sum  B_sum    A_prod  B_prod
      │ Int64  Float64  Int64   Float64
@@ -830,7 +837,11 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
 If you would prefer the result to have the same number of rows as the source
 data frame, use `select` instead of `combine`.
 
-Note that a `DataFrame` can store values of any type as its columns, for example
+In the remainder of this section we will discuss some of the more advanced topis
+related to operation specification syntax, so you may decide to skip them if you
+want to focus on the most common usage patterns.
+
+A `DataFrame` can store values of any type as its columns, for example
 below we show how one can store a `Tuple`:
 
 ```
@@ -844,11 +855,22 @@ julia> df2 = combine(df, All() .=> extrema)
 
 Later you might want to expand the tuples into separate columns storing the computed
 minima and maxima. This can be achieved by passing multiple columns for the output.
-In the example below we show how this can be done in combination with a function
-so that we can generate target column names conditional on source column names:
+Here is an example how this can be done by writing the column names by-hand for a single
+input column:
+
+```
+julia> combine(df2, "A_extrema" => identity => ["A_min", "A_max"])
+1×2 DataFrame
+ Row │ A_min  A_max
+     │ Int64  Int64
+─────┼──────────────
+   1 │     1      4
+```
+
+You can extend it to handling all columns in `df2` using broadcasting:
 
 ```
-julia> combine(df2, All() .=> identity .=> [c -> first(c) .* ["_min", "_max"]])
+julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max"]])
 1×4 DataFrame
  Row │ A_min  A_max  B_min    B_max
      │ Int64  Int64  Float64  Float64
@@ -856,15 +878,28 @@ julia> combine(df2, All() .=> identity .=> [c -> first(c) .* ["_min", "_max"]])
    1 │     1      4      1.0      4.0
 ```
 
-Note that in this example we needed to pass `identity` explicitly as otherwise the
-functions generated with `c -> first(c) .* ["_min", "_max"]` would be treated as transformations
-and not as rules for target column names generation.
+This approach works, but can be improved. Instead of writing all the column names
+manually we can instead use a function as a way to specify target column names
+conditional on source column names:
+
+```
+julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
+1×4 DataFrame
+ Row │ A_min  A_max  B_min    B_max
+     │ Int64  Int64  Float64  Float64
+─────┼────────────────────────────────
+   1 │     1      4      1.0      4.0
+```
+
+Note that in this example we needed to pass `identity` explicitly as with
+`All() =>  (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
+treated as a transformation and not as a rule for target column names generation.
 
 You might want to perform the transformation of the source data frame into the result
 we have just shown in one step. This can be achieved with the following expression:
 
 ```
-julia> combine(df, All() .=> Ref∘extrema .=> [c -> c .* ["_min", "_max"]])
+julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"])
 1×4 DataFrame
  Row │ A_min  A_max  B_min    B_max
      │ Int64  Int64  Float64  Float64
@@ -873,28 +908,18 @@ julia> combine(df, All() .=> Ref∘extrema .=> [c -> c .* ["_min", "_max"]])
 ```
 
 Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
-The reason why this is needed is that instead `combine` iterates the contents of the value returned
-by the operation specification function and tries to expand it, which in our case is a tuple of numbers,
+Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function,
+which in our case is a tuple of numbers, and tries to expand it assuming that each produced value specifies one row,
 so one gets an error:
 
 ```
-julia> combine(df, names(df) .=> extrema .=> [c -> c .* ["_min", "_max"]])
+julia> combine(df, All() .=> extrema .=> [c -> c .* ["_min", "_max"]])
 ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values,
 which doesn't satisfy the Tables.jl `AbstractRow` interface
 ```
 
 Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one
-wants to store one value, however, in general it could be another iterator. Here is an example
-when the tuple returned by `extrema` is wrapped in a `Tuple`, producing the same result:
-
-```
-julia> combine(df, names(df) .=> tuple∘extrema .=> [c -> c .* ["_min", "_max"]])
-1×4 DataFrame
- Row │ A_min  A_max  B_min    B_max
-     │ Int64  Int64  Float64  Float64
-─────┼────────────────────────────────
-   1 │     1      4      1.0      4.0
-```
+wants to store one row, however, in general it could be another iterator (e.g. a tuple).
 
 ## Handling of Columns Stored in a `DataFrame`
 

From c3f4d5e425f65cacd8866eeb4b9d332bc074bcae Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Fri, 19 Apr 2024 22:51:47 +0200
Subject: [PATCH 3/4] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/working_with_dataframes.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md
index ac13eeb6d5..fdd2b694eb 100755
--- a/docs/src/man/working_with_dataframes.md
+++ b/docs/src/man/working_with_dataframes.md
@@ -837,8 +837,8 @@ julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcas
 If you would prefer the result to have the same number of rows as the source
 data frame, use `select` instead of `combine`.
 
-In the remainder of this section we will discuss some of the more advanced topis
-related to operation specification syntax, so you may decide to skip them if you
+In the remainder of this section we will discuss more advanced topics related
+to the operation specification syntax, so you may decide to skip them if you
 want to focus on the most common usage patterns.
 
 A `DataFrame` can store values of any type as its columns, for example
@@ -855,7 +855,7 @@ julia> df2 = combine(df, All() .=> extrema)
 
 Later you might want to expand the tuples into separate columns storing the computed
 minima and maxima. This can be achieved by passing multiple columns for the output.
-Here is an example how this can be done by writing the column names by-hand for a single
+Here is an example of how this can be done by writing the column names by-hand for a single
 input column:
 
 ```
@@ -880,7 +880,7 @@ julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max
 
 This approach works, but can be improved. Instead of writing all the column names
 manually we can instead use a function as a way to specify target column names
-conditional on source column names:
+based on source column names:
 
 ```
 julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
@@ -909,7 +909,7 @@ julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"])
 
 Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
 Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function,
-which in our case is a tuple of numbers, and tries to expand it assuming that each produced value specifies one row,
+which in our case is a tuple of numbers, and tries to expand it assuming that each produced value represents one row,
 so one gets an error:
 
 ```

From e8a9212c67ea1647f2763537a28cd783bdfee4f0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= <bkamins@sgh.waw.pl>
Date: Sun, 22 Sep 2024 22:10:53 +0200
Subject: [PATCH 4/4] Apply suggestions from code review

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
---
 docs/src/man/working_with_dataframes.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/man/working_with_dataframes.md b/docs/src/man/working_with_dataframes.md
index fdd2b694eb..a74298c186 100755
--- a/docs/src/man/working_with_dataframes.md
+++ b/docs/src/man/working_with_dataframes.md
@@ -891,8 +891,8 @@ julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
    1 │     1      4      1.0      4.0
 ```
 
-Note that in this example we needed to pass `identity` explicitly as with
-`All() =>  (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
+Note that in this example we needed to pass `identity` explicitly since with
+`All() => (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
 treated as a transformation and not as a rule for target column names generation.
 
 You might want to perform the transformation of the source data frame into the result