From dc59622fee8a45d3eb106d3f648231d3b7b8ecdf Mon Sep 17 00:00:00 2001 From: Nathan Boyer <65452054+nathanrboyer@users.noreply.github.com> Date: Fri, 13 Dec 2024 06:46:54 -0500 Subject: [PATCH] Updated Basic Usage of Manipulation Functions (#3360) --- docs/src/man/basics.md | 2197 ++++++++++++++++++++++++++++++---------- 1 file changed, 1654 insertions(+), 543 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index d1962262b..03e5c5082 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1109,7 +1109,7 @@ true If in indexing you select a subset of rows from a data frame the mutation is performed in place, i.e. writing to an existing vector. -Below setting values of column `:Job` in rows `1:3` to values `[2, 4, 6]`: +Below setting values of column `:Job` in rows `1:3` to values `[2, 3, 2]`: ```jldoctest dataframe julia> df1[1:3, :Job] = [2, 3, 2] @@ -1215,7 +1215,7 @@ DataFrameRow 2 │ 98 male 2 ``` -This operations updated the data stored in the `df1` data frame. +These operations updated the data stored in the `df1` data frame. In a similar fashion views can be used to update data stored in their parent data frame. Here are some examples: @@ -1599,604 +1599,1715 @@ julia> german[Not(5), r"S"] 984 rows omitted ``` -## Basic Usage of Transformation Functions +## Manipulation Functions -In DataFrames.jl we have five functions that we can be used to perform -transformations of columns of a data frame: +The seven functions below can be used to manipulate data frames +by applying operations to them. -- `combine`: creates a new data frame populated with columns that are results of - transformation applied to the source data frame columns, potentially combining - its rows; -- `select`: creates a new data frame that has the same number of rows as the - source data frame populated with columns that are results of transformations - applied to the source data frame columns; -- `select!`: the same as `select` but updates the passed data frame in place; -- `transform`: the same as `select` but keeps the columns that were already - present in the data frame (note though that these columns can be potentially - modified by the transformation passed to `transform`); -- `transform!`: the same as `transform` but updates the passed data frame in - place. +The functions without a `!` in their name +will create a new data frame based on the source data frame, +so you will probably want to store the new data frame to a new variable name, +e.g. `new_df = transform(source_df, operation)`. +The functions with a `!` at the end of their name +will modify an existing data frame in-place, +so there is typically no need to assign the result to a variable, +e.g. `transform!(source_df, operation)` instead of +`source_df = transform(source_df, operation)`. -The fundamental ways to specify a transformation are: +The number of columns and rows in the resultant data frame varies +depending on the manipulation function employed. -- `source_column => transformation => target_column_name`; In this scenario the - `source_column` is passed as an argument to `transformation` function and - stored in `target_column_name` column. -- `source_column => transformation`; In this scenario we apply the - transformation function to `source_column` and the target column names is - automatically generated. -- `source_column => target_column_name` renames the `source_column` to - `target_column_name`. -- `source_column` just keep the source column as is in the result without any - transformation; +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | +| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | +| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | -These rules are typically called transformation mini-language. +### Constructing Operations -Let us move to the examples of application of these rules +All of the functions above use the same syntax which is commonly +`manipulation_function(dataframe, operation)`. +The `operation` argument defines the +operation to be applied to the source `dataframe`, +and it can take any of the following common forms explained below: -```jldoctest dataframe -julia> using Statistics +`source_column_selector` +: selects source column(s) without manipulating or renaming them + + Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` + +`source_column_selector => operation_function` +: passes source column(s) as arguments to a function +and automatically names the resulting column(s) + + Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` + +`source_column_selector => operation_function => new_column_names` +: passes source column(s) as arguments to a function +and names the resulting column(s) `new_column_names` + + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => (+) => :a_plus_b` + + *(Not available for `subset` or `subset!`)* + +`source_column_selector => new_column_names` +: renames a source column, +or splits a column containing collection elements into multiple new columns + + Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` + + (*Not available for `subset` or `subset!`*) + +The `=>` operator constructs a +[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), +which is a type to link one object to another. +(Pairs are commonly used to create elements of a +[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) +In DataFrames.jl manipulation functions, +`Pair` arguments are used to define column `operations` to be performed. +The examples shown above will be explained in more detail later. + +*The manipulation functions also have methods for applying multiple operations. +See the later sections [Applying Multiple Operations per Manipulation](@ref) +and [Broadcasting Operation Pairs](@ref) for more information.* + +#### `source_column_selector` +Inside an `operation`, `source_column_selector` is usually a column name +or column index which identifies a data frame column. + +`source_column_selector` may be used as the entire `operation` +with `select` or `select!` to isolate or reorder columns. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) +3×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 + +julia> select(df, :b) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, "b") +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, 2) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 +``` + +`source_column_selector` may also be used as the entire `operation` +with `subset` or `subset!` if the source column contains `Bool` values. + +```julia +julia> df = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + ) +4×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Scott false + 2 │ Jill true + 3 │ Erica false + 4 │ Jimmy true + +julia> subset(df, :minor) +2×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Jill true + 2 │ Jimmy true +``` + +`source_column_selector` may instead be a collection of columns such as a vector, +a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), +a `Not`, `Between`, `All`, or `Cols` expression, +or a `:`. +See the [Indexing](@ref) API for the full list of possible values with references. + +!!! note + + The Julia parser sometimes prevents `:` from being used by itself. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. + +```julia +julia> df = DataFrame( + id = [1, 2, 3], + first_name = ["José", "Emma", "Nathan"], + last_name = ["Garcia", "Marino", "Boyer"], + age = [61, 24, 33] + ) +3×4 DataFrame + Row │ id first_name last_name age + │ Int64 String String Int64 +─────┼───────────────────────────────────── + 1 │ 1 José Garcia 61 + 2 │ 2 Emma Marino 24 + 3 │ 3 Nathan Boyer 33 + +julia> select(df, [:last_name, :first_name]) +3×2 DataFrame + Row │ last_name first_name + │ String String +─────┼─────────────────────── + 1 │ Garcia José + 2 │ Marino Emma + 3 │ Boyer Nathan + +julia> select(df, r"name") +3×2 DataFrame + Row │ first_name last_name + │ String String +─────┼─────────────────────── + 1 │ José Garcia + 2 │ Emma Marino + 3 │ Nathan Boyer + +julia> select(df, Not(:id)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> select(df, Between(2,4)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> df2 = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + male = [true, false, false, true], + ) +4×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼────────────────────── + 1 │ Scott false true + 2 │ Jill true false + 3 │ Erica false false + 4 │ Jimmy true true + +julia> subset(df2, [:minor, :male]) +1×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼───────────────────── + 1 │ Jimmy true true +``` + +!!! note + + Using `Symbol` in `source_column_selector` will perform slightly faster than using string. + However, a string is convenient when column names contain spaces. + + All elements of `source_column_selector` must be the same type + (unless wrapped in `Cols`), + e.g. `subset(df2, [:minor, "male"])` will error + since `Symbol` and string are used simultaneously. + +#### `operation_function` +Inside an `operation` pair, `operation_function` is a function +which operates on data frame columns passed as vectors. +When multiple columns are selected by `source_column_selector`, +the `operation_function` will receive the columns as separate positional arguments +in the order they were selected, e.g. `f(column1, column2, column3)`. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 4 + +julia> combine(df, :a => sum) +1×1 DataFrame + Row │ a_sum + │ Int64 +─────┼─────── + 1 │ 6 + +julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows +3×3 DataFrame + Row │ a b b_maximum + │ Int64 Int64 Int64 +─────┼───────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 5 + 3 │ 3 4 5 + +julia> transform(df, [:b, :a] => -) # vector subtraction is okay +3×3 DataFrame + Row │ a b b_a_- + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 3 + 2 │ 2 5 3 + 3 │ 3 4 1 + +julia> transform(df, [:a, :b] => *) # vector multiplication is not defined +ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) +``` + +Don't worry! There is a quick fix for the previous error. +If you want to apply a function to each element in a column +instead of to the entire column vector, +then you can wrap your element-wise function in `ByRow` like +`ByRow(my_elementwise_function)`. +This will apply `my_elementwise_function` to every element in the column +and then collect the results back into a vector. + +```julia +julia> transform(df, [:a, :b] => ByRow(*)) +3×3 DataFrame + Row │ a b a_b_* + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 4 + 2 │ 2 5 10 + 3 │ 3 4 12 + +julia> transform(df, Cols(:) => ByRow(max)) +3×3 DataFrame + Row │ a b a_b_max + │ Int64 Int64 Int64 +─────┼─────────────────────── + 1 │ 1 4 4 + 2 │ 2 5 5 + 3 │ 3 4 4 + +julia> f(x) = x + 1 +f (generic function with 1 method) + +julia> transform(df, :a => ByRow(f)) +3×3 DataFrame + Row │ a b a_f + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +Alternatively, you may just want to define the function itself so it +[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +over vectors. + +```julia +julia> g(x) = x .+ 1 +g (generic function with 1 method) + +julia> transform(df, :a => g) +3×3 DataFrame + Row │ a b a_g + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 + +julia> h(x, y) = x .+ y .+ 1 +h (generic function with 1 method) + +julia> transform(df, [:a, :b] => h) +3×3 DataFrame + Row │ a b a_b_h + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 6 + 2 │ 2 5 8 + 3 │ 3 4 8 +``` + +[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) +are a convenient way to define and use an `operation_function` +all within the manipulation function call. + +```julia +julia> select(df, :a => ByRow(x -> x + 1)) +3×1 DataFrame + Row │ a_function + │ Int64 +─────┼──────────── + 1 │ 2 + 2 │ 3 + 3 │ 4 + +julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) +3×3 DataFrame + Row │ a b a_b_function + │ Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 + +julia> subset(df, :b => ByRow(x -> x < 5)) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 + +julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 +``` + +!!! note + + `operation_functions` within `subset` or `subset!` function calls + must return a Boolean vector. + `true` elements in the Boolean vector will determine + which rows are retained in the resulting data frame. + +As demonstrated above, `DataFrame` columns are usually passed +from `source_column_selector` to `operation_function` as one or more +vector arguments. +However, when `AsTable(source_column_selector)` is used, +the selected columns are collected and passed as a single `NamedTuple` +to `operation_function`. + +This is often useful when your `operation_function` is defined to operate +on a single collection argument rather than on multiple positional arguments. +The distinction is somewhat similar to the difference between the built-in +`min` and `minimum` functions. +`min` is defined to find the minimum value among multiple positional arguments, +while `minimum` is defined to find the minimum value +among the elements of a single collection argument. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 2 + 2 │ 2 4 6 1 + +julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments +2×1 DataFrame + Row │ a_b_etc_min + │ Int64 +─────┼───────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection +2×1 DataFrame + Row │ a_b_etc_minimum + │ Int64 +─────┼───────────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments +2×1 DataFrame + Row │ a_b_+ + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 6 + +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection +2×1 DataFrame + Row │ a_b_sum + │ Int64 +─────┼───────── + 1 │ 4 + 2 │ 6 + +julia> using Statistics # contains the `mean` function + +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection +2×1 DataFrame + Row │ b_c_d_mean + │ Float64 +─────┼──────────── + 1 │ 3.33333 + 2 │ 3.66667 +``` + +`AsTable` can also be used to pass columns to a function which operates +on fields of a `NamedTuple`. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 7 + 2 │ 2 4 6 8 + +julia> f(nt) = nt.a + nt.d +f (generic function with 1 method) + +julia> transform(df, AsTable(:) => ByRow(f)) +2×5 DataFrame + Row │ a b c d a_b_etc_f + │ Int64 Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────── + 1 │ 1 3 5 7 8 + 2 │ 2 4 6 8 10 +``` + +As demonstrated above, +in the `source_column_selector => operation_function` operation pair form, +the results of an operation will be placed into a new column with an +automatically-generated name based on the operation; +the new column name will be the `operation_function` name +appended to the source column name(s) with an underscore. + +This automatic column naming behavior can be avoided in two ways. +First, the operation result can be placed back into the original column +with the original column name by switching the keyword argument `renamecols` +from its default value (`true`) to `renamecols=false`. +This option prevents the function name from being appended to the column name +as it usually would be. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 11 5 + 2 │ 12 6 + 3 │ 13 7 + 4 │ 14 8 +``` + +The second method to avoid the default manipulation column naming is to +specify your own `new_column_names`. + +#### `new_column_names` + +`new_column_names` can be included at the end of an `operation` pair to specify +the name of the new column(s). +`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, Cols(:) => ByRow(+) => :c) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, Cols(:) => ByRow(+) => "a+b") +4×3 DataFrame + Row │ a b a+b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, :a => ByRow(x->x+10) => "a+10") +4×3 DataFrame + Row │ a b a+10 + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 11 + 2 │ 2 6 12 + 3 │ 3 7 13 + 4 │ 4 8 14 +``` + +The `source_column_selector => new_column_names` operation form +can be used to rename columns without an intermediate function. +However, there are `rename` and `rename!` functions, +which accept similar syntax, +that tend to be more useful for this operation. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => :apple) # adds column `apple` +4×3 DataFrame + Row │ a b apple + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 + +julia> select(df, :a => :apple) # retains only column `apple` +4×1 DataFrame + Row │ apple + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + 4 │ 4 + +julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place +4×2 DataFrame + Row │ apple b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` + +If `new_column_names` already exist in the source data frame, +those columns will be replaced in the existing column location +rather than being added to the end. +This can be done by manually specifying an existing column name +or by using the `renamecols=false` keyword argument. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name +4×3 DataFrame + Row │ a b b_function + │ Int64 Int64 Int64 +─────┼────────────────────────── + 1 │ 1 5 15 + 2 │ 2 6 16 + 3 │ 3 7 17 + 4 │ 4 8 18 + +julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 15 + 2 │ 2 16 + 3 │ 3 17 + 4 │ 4 18 + +julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 15 5 + 2 │ 16 6 + 3 │ 17 7 + 4 │ 18 8 +``` + +Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. + +```julia +julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name +4×3 DataFrame + Row │ a b a_b_+ + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name +4×3 DataFrame + Row │ a b a_b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 6 5 + 2 │ 8 6 + 3 │ 10 7 + 4 │ 12 8 +``` + +In the `source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may also be a renaming function which operates on a string +to create the destination column names programmatically. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> add_prefix(s) = "new_" * s +add_prefix (generic function with 1 method) + +julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 + +julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 +``` + +!!! note + + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` + +A renaming function will not work in the +`source_column_selector => new_column_names` operation form +because a function in the second element of the operation pair is assumed to take +the `source_column_selector => operation_function` operation form. +To work around this limitation, use the +`source_column_selector => operation_function => new_column_names` operation form +with `identity` as the `operation_function`. + +```julia +julia> transform(df, :a => add_prefix) +ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) + +julia> transform(df, :a => identity => add_prefix) +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 +``` + +In this case though, +it is probably again more useful to use the `rename` or `rename!` function +rather than one of the manipulation functions +in order to rename in-place and avoid the intermediate `operation_function`. +```julia +julia> rename(add_prefix, df) # rename all columns with a function +4×2 DataFrame + Row │ new_a new_b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> rename(add_prefix, df; cols=:a) # rename some columns with a function +4×2 DataFrame + Row │ new_a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` + +In the `source_column_selector => new_column_names` operation form, +only a single source column may be selected per operation, +so why is `new_column_names` plural? +It is possible to split the data contained inside a single column +into multiple new columns by supplying a vector of strings or symbols +as `new_column_names`. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> transform(df, :data => [:first, :second]) # manual naming +2×3 DataFrame + Row │ data first second + │ Tuple… Int64 Int64 +─────┼─────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +This kind of data splitting can even be done automatically with `AsTable`. + +```julia +julia> transform(df, :data => AsTable) # default automatic naming with tuples +2×3 DataFrame + Row │ data x1 x2 + │ Tuple… Int64 Int64 +─────┼────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +If a data frame column contains `NamedTuple`s, +then `AsTable` will preserve the field names. +```julia +julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples +2×1 DataFrame + Row │ data + │ NamedTup… +─────┼──────────────── + 1 │ (a = 1, b = 2) + 2 │ (a = 3, b = 4) + +julia> transform(df, :data => AsTable) # keeps names from named tuples +2×3 DataFrame + Row │ data a b + │ NamedTup… Int64 Int64 +─────┼────────────────────────────── + 1 │ (a = 1, b = 2) 1 2 + 2 │ (a = 3, b = 4) 3 4 +``` + +!!! note + + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. + +Renaming functions also work for multi-column transformations, +but they must operate on a vector of strings. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> new_names(v) = ["primary ", "secondary "] .* v +new_names (generic function with 1 method) + +julia> transform(df, :data => identity => new_names) +2×3 DataFrame + Row │ data primary data secondary data + │ Tuple… Int64 Int64 +─────┼────────────────────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +### Applying Multiple Operations per Manipulation +All data frame manipulation functions can accept multiple `operation` pairs +at once using any of the following methods: +- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments +- `manipulation_function(dataframe, [operation1, operation2])` : vector argument +- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument + +Passing multiple operations is especially useful for the `select`, `select!`, +and `combine` manipulation functions, +since they only retain columns which are a result of the passed operations. + +```julia +julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 1 50 hat + 2 │ 2 50 bat + 3 │ 3 60 cat + 4 │ 4 60 dog + +julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations +1×3 DataFrame + Row │ a_maximum b_sum c_join + │ Int64 Int64 String +─────┼──────────────────────────────── + 1 │ 4 220 hatbatcatdog + +julia> select(df, :c, :b, :a) # re-order columns +4×3 DataFrame + Row │ c b a + │ String Int64 Int64 +─────┼────────────────────── + 1 │ hat 50 1 + 2 │ bat 50 2 + 3 │ cat 60 3 + 4 │ dog 60 4 + +ulia> select(df, :b, :) # `:` here means all other columns +4×3 DataFrame + Row │ b a c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 50 1 hat + 2 │ 50 2 bat + 3 │ 60 3 cat + 4 │ 60 4 dog + +julia> select( + df, + :c => (x -> "a " .* x) => :one_c, + :a => (x -> 100x), + :b, + renamecols=false + ) # can mix operation forms +4×3 DataFrame + Row │ one_c a b + │ String Int64 Int64 +─────┼────────────────────── + 1 │ a hat 100 50 + 2 │ a bat 200 50 + 3 │ a cat 300 60 + 4 │ a dog 400 60 + +julia> select( + df, + :c => ByRow(reverse), + :c => ByRow(uppercase) + ) # multiple operations on same column +4×2 DataFrame + Row │ c_reverse c_uppercase + │ String String +─────┼──────────────────────── + 1 │ tah HAT + 2 │ tab BAT + 3 │ tac CAT + 4 │ god DOG +``` + +In the last two examples, +the manipulation function arguments were split across multiple lines. +This is a good way to make manipulations with many operations more readable. + +Passing multiple operations to `subset` or `subset!` is an easy way to narrow in +on a particular row of data. + +```julia +julia> subset( + df, + :b => ByRow(==(60)), + :c => ByRow(contains("at")) + ) # rows with 60 and "at" +1×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 3 60 cat +``` + +Note that all operations within a single manipulation must use the data +as it existed before the function call +i.e. you cannot use newly created columns for subsequent operations +within the same manipulation. + +```julia +julia> transform( + df, + [:a, :b] => ByRow(+) => :d, + :d => (x -> x ./ 2), + ) # requires two separate transformations +ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c + +julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) +4×4 DataFrame + Row │ a b c d + │ Int64 Int64 String Int64 +─────┼───────────────────────────── + 1 │ 1 50 hat 51 + 2 │ 2 50 bat 52 + 3 │ 3 60 cat 63 + 4 │ 4 60 dog 64 + +julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) +4×5 DataFrame + Row │ a b c d d_2 + │ Int64 Int64 String Int64 Float64 +─────┼────────────────────────────────────── + 1 │ 1 50 hat 51 25.5 + 2 │ 2 50 bat 52 26.0 + 3 │ 3 60 cat 63 31.5 + 4 │ 4 60 dog 64 32.0 +``` + + +### Broadcasting Operation Pairs + +[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +pairs with `.=>` is often a convenient way to generate multiple +similar `operation`s to be applied within a single manipulation. +Broadcasting within the `Pair` of an `operation` is no different than +broadcasting in base Julia. +The broadcasting `.=>` will be expanded into a vector of pairs +(`[operation1, operation2, ...]`), +and this expansion will occur before the manipulation function is invoked. +Then the manipulation function will use the +`manipulation_function(dataframe, [operation1, operation2, ...])` method. +This process will be explained in more detail below. + +To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. +In DataFrames.jl, a symbol, string, or integer +may be used to select a single column. +Some `Pair`s with these types are below. + +```julia +julia> typeof(:x => :a) +Pair{Symbol, Symbol} + +julia> typeof("x" => "a") +Pair{String, String} + +julia> typeof(1 => "a") +Pair{Int64, String} +``` + +Any of the `Pair`s above could be used to rename the first column +of the data frame below to `a`. + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> select(df, :x => :a) +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> select(df, 1 => "a") +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 +``` + +What should we do if we want to keep and rename both the `x` and `y` column? +One option is to supply a `Vector` of operation `Pair`s to `select`. +`select` will process all of these operations in order. + +```julia +julia> ["x" => "a", "y" => "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x" => "a", "y" => "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +We can use broadcasting to simplify the syntax above. + +```julia +julia> ["x", "y"] .=> ["a", "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x", "y"] .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Notice that `select` sees the same `Vector{Pair{String, String}}` operation +argument whether the individual pairs are written out explicitly or +constructed with broadcasting. +The broadcasting is applied before the call to `select`. + +```julia +julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) +true +``` -julia> combine(german, :Age => mean => :mean_age) -1×1 DataFrame - Row │ mean_age - │ Float64 -─────┼────────── - 1 │ 35.546 +!!! note -julia> select(german, :Age => mean => :mean_age) -1000×1 DataFrame - Row │ mean_age - │ Float64 -──────┼────────── - 1 │ 35.546 - 2 │ 35.546 - 3 │ 35.546 - 4 │ 35.546 - 5 │ 35.546 - 6 │ 35.546 - 7 │ 35.546 - 8 │ 35.546 - ⋮ │ ⋮ - 994 │ 35.546 - 995 │ 35.546 - 996 │ 35.546 - 997 │ 35.546 - 998 │ 35.546 - 999 │ 35.546 - 1000 │ 35.546 - 985 rows omitted -``` - -As you can see in both cases the `mean` function was applied to `:Age` column -and the result was stored in the `:mean_age` column. The difference between -the `combine` and `select` functions is that the `combine` aggregates data -and produces as many rows as were returned by the transformation function. -On the other hand the `select` function always keeps the number of rows in a -data frame to be the same as in the source data frame. Therefore in this case -the result of the `mean` function got broadcasted. - -As `combine` potentially allows any number of rows to be produced as a result -of the transformation if we have a combination of transformations where some of -them produce a vector, and other produce scalars then scalars get broadcasted -exactly like in `select`. Here is an example: + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` + +In Julia, +a non-vector broadcasted with a vector will be repeated in each resultant pair element. -```jldoctest dataframe -julia> combine(german, :Age => mean => :mean_age, :Housing => unique => :housing) +```julia +julia> ["x", "y"] .=> :a # :a is repeated +2-element Vector{Pair{String, Symbol}}: + "x" => :a + "y" => :a + +julia> 1 .=> [:a, :b] # 1 is repeated +2-element Vector{Pair{Int64, Symbol}}: + 1 => :a + 1 => :b +``` + +We can use this fact to easily broadcast an `operation_function` to multiple columns. + +```julia +julia> f(x) = 2 * x +f (generic function with 1 method) + +julia> ["x", "y"] .=> f # f is repeated +2-element Vector{Pair{String, typeof(f)}}: + "x" => f + "y" => f + +julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming +3×2 DataFrame + Row │ x_f y_f + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + +julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated +2-element Vector{Pair{String, Pair{typeof(f), String}}}: + "x" => (f => "a") + "y" => (f => "b") + +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming 3×2 DataFrame - Row │ mean_age housing - │ Float64 String7 -─────┼─────────────────── - 1 │ 35.546 own - 2 │ 35.546 free - 3 │ 35.546 rent + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 ``` -Note, however, that it is not allowed to return vectors of different lengths in -different transformations: +A renaming function can be applied to multiple columns in the same way. +It will also be repeated in each operation `Pair`. -```jldoctest dataframe -julia> combine(german, :Age, :Housing => unique => :Housing) -ERROR: ArgumentError: New columns must have the same length as old columns +```julia +julia> newname(s::String) = s * "_new" +newname (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated +2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: + "x" => (f => newname) + "y" => (f => newname) + +julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname +3×2 DataFrame + Row │ x_new y_new + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 ``` -Let us discuss some other examples using `select`. Often we want to apply some -function not to the whole column of a data frame, but rather to its individual -elements. Normally we can achieve this using broadcasting like this: +You can see from the type output above +that a three element pair does not actually exist. +A `Pair` (as the name implies) can only contain two elements. +Thus, `:x => :y => :z` becomes a nested `Pair`, +where `:x` is the first element and points to the `Pair` `:y => :z`, +which is the second element. -```jldoctest dataframe -julia> select(german, :Sex => (x -> uppercase.(x)) => :Sex) -1000×1 DataFrame - Row │ Sex - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +```julia +julia> p = :x => :y => :z +:x => (:y => :z) + +julia> p[1] +:x + +julia> p[2] +:y => :z + +julia> p[2][1] +:y + +julia> p[2][2] +:z + +julia> p[3] # there is no index 3 for a pair +ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] ``` -This pattern is encountered very often in practice, therefore there is a `ByRow` -convenience wrapper for a function that creates its broadcasted variant. In -these examples `ByRow` is a special type used for selection operations to signal -that the wrapped function should be applied to each element (row) of the -selection. Here we are passing `ByRow` wrapper to target column name `:Sex` -using `uppercase` function: +In the previous examples, the source columns have been individually selected. +When broadcasting multiple columns to the same function, +often similarities in the column names or position can be exploited to avoid +tedious selection. +Consider a data frame with temperature data at three different locations +taken over time. +```julia +julia> df = DataFrame(Time = 1:4, + Temperature1 = [20, 23, 25, 28], + Temperature2 = [33, 37, 41, 44], + Temperature3 = [15, 10, 4, 0]) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 20 33 15 + 2 │ 2 23 37 10 + 3 │ 3 25 41 4 + 4 │ 4 28 44 0 +``` + +To convert all of the temperature data in one transformation, +we just need to define a conversion function and broadcast +it to all of the "Temperature" columns. + +```julia +julia> celsius_to_kelvin(x) = x + 273 +celsius_to_kelvin (generic function with 1 method) + +julia> transform( + df, + Cols(r"Temp") .=> ByRow(celsius_to_kelvin), + renamecols = false + ) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` +Or, simultaneously changing the column names: -```jldoctest dataframe -julia> select(german, :Sex => ByRow(uppercase) => :SEX) -1000×1 DataFrame - Row │ SEX - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +```julia +julia> rename_function(s) = "Temperature $(last(s)) (K)" +rename_function (generic function with 1 method) + +julia> select( + df, + "Time", + Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function + ) +4×4 DataFrame + Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 ``` -In this case we transform our source column `:Age` using `ByRow` wrapper and -automatically generate the target column name: +!!! note "Notes" -```jldoctest dataframe -julia> select(german, :Age, :Age => ByRow(sqrt)) -1000×2 DataFrame - Row │ Age Age_sqrt - │ Int64 Float64 -──────┼───────────────── - 1 │ 67 8.18535 - 2 │ 22 4.69042 - 3 │ 49 7.0 - 4 │ 45 6.7082 - 5 │ 53 7.28011 - 6 │ 35 5.91608 - 7 │ 53 7.28011 - 8 │ 35 5.91608 - ⋮ │ ⋮ ⋮ - 994 │ 30 5.47723 - 995 │ 50 7.07107 - 996 │ 31 5.56776 - 997 │ 40 6.32456 - 998 │ 38 6.16441 - 999 │ 23 4.79583 - 1000 │ 27 5.19615 - 985 rows omitted + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. + +You could also broadcast different columns to different functions +by supplying a vector of functions. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> f1(x) = x .+ 1 +f1 (generic function with 1 method) + +julia> f2(x) = x ./ 10 +f2 (generic function with 1 method) + +julia> transform(df, [:a, :b] .=> [f1, f2]) +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 ``` -When we pass just a column (without the `=>` part) we can use any column selector -that is allowed in indexing. +However, this form is not much more convenient than supplying +multiple individual operations. -Here we exclude the column `:Age` from the resulting data frame: +```julia +julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +Perhaps more useful for broadcasting syntax +is to apply multiple functions to multiple columns +by changing the vector of functions to a 1-by-x matrix of functions. +(Recall that a list, a vector, or a matrix of operation pairs are all valid +for passing to the manipulation functions.) -```jldoctest dataframe -julia> select(german, Not(:Age)) -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +```julia +julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 +2×2 Matrix{Pair{Symbol}}: + :a=>f1 :a=>f2 + :b=>f1 :b=>f2 + +julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 +4×6 DataFrame + Row │ a b a_f1 b_f1 a_f2 b_f2 + │ Int64 Int64 Int64 Int64 Float64 Float64 +─────┼────────────────────────────────────────────── + 1 │ 1 5 2 6 0.1 0.5 + 2 │ 2 6 3 7 0.2 0.6 + 3 │ 3 7 4 8 0.3 0.7 + 4 │ 4 8 5 9 0.4 0.8 +``` + +In this way, every combination of selected columns and functions will be applied. + +Pair broadcasting is a simple but powerful tool +that can be used in any of the manipulation functions listed under +[Manipulation Functions](@ref). +Experiment for yourself to discover other useful operations. + +### Additional Resources +More details and examples of operation pair syntax can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). +(The official wording describing the syntax has changed since the blog post was written, +but the examples are still illustrative. +The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language +or Domain-Specific Language.) + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. + +## Approach Comparison + +After that deep dive into [Manipulation Functions](@ref), +it is a good idea to review the alternative approaches covered in +[Getting and Setting Data in a Data Frame](@ref). +Let us compare the approaches with a few examples. + +For simple operations, +often getting/setting data with dot syntax +is simpler than the equivalent data frame manipulation. +Here we will add the two columns of our data frame together +and place the result in a new third column. + +**Setup:** + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) # define a data frame +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 ``` -In the next example we drop columns `"Age"`, `"Saving accounts"`, -`"Checking account"`, `"Credit amount"`, and `"Purpose"`. Note that this time -we use string column selectors because some of the column names have spaces -in them: +**Manipulation:** -```jldoctest dataframe -julia> select(german, Not(["Age", "Saving accounts", "Checking account", - "Credit amount", "Purpose"])) -1000×5 DataFrame - Row │ id Sex Job Housing Duration - │ Int64 String7 Int64 String7 Int64 -──────┼────────────────────────────────────────── - 1 │ 0 male 2 own 6 - 2 │ 1 female 2 own 48 - 3 │ 2 male 1 own 12 - 4 │ 3 male 2 free 42 - 5 │ 4 male 2 free 24 - 6 │ 5 male 1 free 36 - 7 │ 6 male 2 own 24 - 8 │ 7 male 3 rent 36 - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ - 994 │ 993 male 3 own 36 - 995 │ 994 male 2 own 12 - 996 │ 995 female 1 own 12 - 997 │ 996 male 3 own 30 - 998 │ 997 male 2 own 12 - 999 │ 998 male 2 free 45 - 1000 │ 999 male 2 own 45 - 985 rows omitted - -``` - -As another example let us present that the `r"S"` regular expression we used -above also works with `select`: +```julia +julia> transform!(df, [:x, :y] => (+) => :z) +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` -```jldoctest dataframe -julia> select(german, r"S") -1000×2 DataFrame - Row │ Sex Saving accounts - │ String7 String15 -──────┼────────────────────────── - 1 │ male NA - 2 │ female little - 3 │ male little - 4 │ male little - 5 │ male little - 6 │ male NA - 7 │ male quite rich - 8 │ male little - ⋮ │ ⋮ ⋮ - 994 │ male little - 995 │ male NA - 996 │ female little - 997 │ male little - 998 │ male little - 999 │ male little - 1000 │ male moderate - 985 rows omitted -``` - -The benefit of `select` or `combine` over indexing is that it is easier -to get the union of several column selectors, e.g.: +**Dot Syntax:** -```jldoctest dataframe -julia> select(german, r"S", "Job", 1) -1000×4 DataFrame - Row │ Sex Saving accounts Job id - │ String7 String15 Int64 Int64 -──────┼──────────────────────────────────────── - 1 │ male NA 2 0 - 2 │ female little 2 1 - 3 │ male little 1 2 - 4 │ male little 2 3 - 5 │ male little 2 4 - 6 │ male NA 1 5 - 7 │ male quite rich 2 6 - 8 │ male little 3 7 - ⋮ │ ⋮ ⋮ ⋮ ⋮ - 994 │ male little 3 993 - 995 │ male NA 2 994 - 996 │ female little 1 995 - 997 │ male little 3 996 - 998 │ male little 2 997 - 999 │ male little 2 998 - 1000 │ male moderate 2 999 - 985 rows omitted -``` - -Taking advantage of this flexibility here is an idiomatic pattern to move some -column to the front of a data frame: +```julia +julia> df.z = df.x + df.y +3-element Vector{Int64}: + 5 + 7 + 9 -```jldoctest dataframe -julia> select(german, "Sex", :) -1000×10 DataFrame - Row │ Sex id Age Job Housing Saving accounts Checking accou ⋯ - │ String7 Int64 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ male 0 67 2 own NA little ⋯ - 2 │ female 1 22 2 own little moderate - 3 │ male 2 49 1 own little NA - 4 │ male 3 45 2 free little little - 5 │ male 4 53 2 free little little ⋯ - 6 │ male 5 35 1 free NA NA - 7 │ male 6 53 2 own quite rich NA - 8 │ male 7 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ male 993 30 3 own little little ⋯ - 995 │ male 994 50 2 own NA NA - 996 │ female 995 31 1 own little NA - 997 │ male 996 40 3 own little little - 998 │ male 997 38 2 own little NA ⋯ - 999 │ male 998 23 2 free little little - 1000 │ male 999 27 2 own moderate moderate - 4 columns and 985 rows omitted +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Recall that the return type from a data frame manipulation function call is always a data frame. +The return type of a data frame column accessed with dot syntax is a `Vector`. +Thus the expression `df.x + df.y` gets the column data as vectors +and returns the result of the vector addition. +However, in that same line, +we assigned the resultant `Vector` to a new column `z` in the data frame `df`. +We could have instead assigned the resultant `Vector` to some other variable, +and then `df` would not have been altered. +The approach with dot syntax is very versatile +since the data getting, mathematics, and data setting can be separate steps. + +```julia +julia> df.x # dot syntax returns a vector +3-element Vector{Int64}: + 1 + 2 + 3 + +julia> v = df.x + df.y # assign mathematical result to a vector `v` +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df.z = v # place `v` into the data frame `df` with the column name `z` +3-element Vector{Int64}: + 5 + 7 + 9 ``` -Below, we are simply passing source column and target column name to rename them -(without specifying the transformation part): +However, one way in which dot syntax is less versatile +is that the column name must be explicitly written in the code. +Indexing syntax is a good alternative in these cases +which is only slightly longer to write than dot syntax. +Both indexing syntax and manipulation functions can operate on dynamic column names +stored in variables. -```jldoctest dataframe -julia> select(german, :Sex => :x1, :Age => :x2) -1000×2 DataFrame - Row │ x1 x2 - │ String7 Int64 -──────┼──────────────── - 1 │ male 67 - 2 │ female 22 - 3 │ male 49 - 4 │ male 45 - 5 │ male 53 - 6 │ male 35 - 7 │ male 53 - 8 │ male 35 - ⋮ │ ⋮ ⋮ - 994 │ male 30 - 995 │ male 50 - 996 │ female 31 - 997 │ male 40 - 998 │ male 38 - 999 │ male 23 - 1000 │ male 27 - 985 rows omitted +**Setup:** + +Imagine this setup data was read from a file and/or entered by a user at runtime. + +```julia +julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # define a data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names ``` -It is important to note that `select` always returns a data frame, even if a -single column selected as opposed to indexing syntax. Compare the following: +**Dot Syntax:** -```jldoctest dataframe -julia> select(german, :Age) -1000×1 DataFrame - Row │ Age - │ Int64 -──────┼─────── - 1 │ 67 - 2 │ 22 - 3 │ 49 - 4 │ 45 - 5 │ 53 - 6 │ 35 - 7 │ 53 - 8 │ 35 - ⋮ │ ⋮ - 994 │ 30 - 995 │ 50 - 996 │ 31 - 997 │ 40 - 998 │ 38 - 999 │ 23 - 1000 │ 27 -985 rows omitted +```julia +julia> df.c1 # dot syntax expects an explicit column name and cannot be used to access variable column name +ERROR: ArgumentError: column name :c1 not found in the data frame +``` -julia> german[:, :Age] -1000-element Vector{Int64}: - 67 - 22 - 49 - 45 - 53 - 35 - 53 - 35 - 61 - 28 - ⋮ - 34 - 23 - 30 - 50 - 31 - 40 - 38 - 23 - 27 -``` - -By default `select` copies columns of a passed source data frame. In order to -avoid copying, pass the `copycols=false` keyword argument: +**Indexing:** -```jldoctest dataframe -julia> df = select(german, :Sex) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +```julia +julia> df[:, c3] = df[:, c1] + df[:, c2] # access columns with names stored in variables +3-element Vector{Int64}: + 5 + 7 + 9 -julia> df.Sex === german.Sex # copy -false +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` -julia> df = select(german, :Sex, copycols=false) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +**Manipulation:** -julia> df.Sex === german.Sex # no-copy is performed -true +```julia +julia> transform!(df, [c1, c2] => (+) => c3) # access columns with names stored in variables +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 ``` -To perform the selection operation in-place use `select!`: +Additionally, manipulation functions only require +the name of the data frame to be written once. +This can be helpful when dealing with long variable and column names. -```jldoctest dataframe -julia> select!(german, Not(:Age)); +**Setup:** -julia> german -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +```julia +julia> my_very_long_data_frame_name = DataFrame( + "My First Column" => 1:3, + "My Second Column" => 4:6 + ) # define a data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names ``` -As you can see the `:Age` column was dropped from the `german` data frame. +**Manipulation:** -The `transform` and `transform!` functions work identically to `select` and -`select!` with the only difference that they retain all columns that are present -in the source data frame. Here are some examples: +```julia -```jldoctest dataframe -julia> german = copy(german_ref); +julia> transform!(my_very_long_data_frame_name, [c1, c2] => (+) => c3) +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` -julia> df = german_ref[1:8, 1:5] -8×5 DataFrame - Row │ id Age Sex Job Housing - │ Int64 Int64 String7 Int64 String7 -─────┼─────────────────────────────────────── - 1 │ 0 67 male 2 own - 2 │ 1 22 female 2 own - 3 │ 2 49 male 1 own - 4 │ 3 45 male 2 free - 5 │ 4 53 male 2 free - 6 │ 5 35 male 1 free - 7 │ 6 53 male 2 own - 8 │ 7 35 male 3 rent - -julia> transform(df, :Age => maximum) -8×6 DataFrame - Row │ id Age Sex Job Housing Age_maximum - │ Int64 Int64 String7 Int64 String7 Int64 +**Indexing:** + +```julia +julia> my_very_long_data_frame_name[:, c3] = my_very_long_data_frame_name[:, c1] + my_very_long_data_frame_name[:, c2] +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 ─────┼──────────────────────────────────────────────────── - 1 │ 0 67 male 2 own 67 - 2 │ 1 22 female 2 own 67 - 3 │ 2 49 male 1 own 67 - 4 │ 3 45 male 2 free 67 - 5 │ 4 53 male 2 free 67 - 6 │ 5 35 male 1 free 67 - 7 │ 6 53 male 2 own 67 - 8 │ 7 35 male 3 rent 67 + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 ``` -In the example below we are swapping values stored in columns `:Sex` and `:Age`: +Another benefit of manipulation functions and indexing over dot syntax is that +it is easier to operate on a subset of columns. -```jldoctest dataframe -julia> transform(german, :Age => :Sex, :Sex => :Age) -1000×10 DataFrame - Row │ id Age Sex Job Housing Saving accounts Checking accou ⋯ - │ Int64 String7 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 67 2 own NA little ⋯ - 2 │ 1 female 22 2 own little moderate - 3 │ 2 male 49 1 own little NA - 4 │ 3 male 45 2 free little little - 5 │ 4 male 53 2 free little little ⋯ - 6 │ 5 male 35 1 free NA NA - 7 │ 6 male 53 2 own quite rich NA - 8 │ 7 male 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 30 3 own little little ⋯ - 995 │ 994 male 50 2 own NA NA - 996 │ 995 female 31 1 own little NA - 997 │ 996 male 40 3 own little little - 998 │ 997 male 38 2 own little NA ⋯ - 999 │ 998 male 23 2 free little little - 1000 │ 999 male 27 2 own moderate moderate - 4 columns and 985 rows omitted +**Setup:** + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6, z = 7:9) # define data frame +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 ``` -If we give more than one source column to a transformation they are passed as -consecutive positional arguments. So for example the -`[:Age, :Job] => (+) => :res` transformation below evaluates `+(df1.Age, df1.Job)` -(which adds two columns) and stores the result in the `:res` column: +**Dot Syntax:** -```jldoctest dataframe -julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) -1000×3 DataFrame - Row │ Age Job res - │ Int64 Int64 Int64 -──────┼───────────────────── - 1 │ 67 2 69 - 2 │ 22 2 24 - 3 │ 49 1 50 - 4 │ 45 2 47 - 5 │ 53 2 55 - 6 │ 35 1 36 - 7 │ 53 2 55 - 8 │ 35 3 38 - ⋮ │ ⋮ ⋮ ⋮ - 994 │ 30 3 33 - 995 │ 50 2 52 - 996 │ 31 1 32 - 997 │ 40 3 43 - 998 │ 38 2 40 - 999 │ 23 2 25 - 1000 │ 27 2 29 - 985 rows omitted -``` - -In the examples given in this introductory tutorial we did not cover all -options of the transformation mini-language. More advanced examples, in particular -showing how to pass or produce multiple columns using the `AsTable` operation -(which you might have seen in some DataFrames.jl demos) are given in the later -sections of the manual. +```julia +julia> df.Not(:x) # will not work; requires a literal column name +ERROR: ArgumentError: column name :Not not found in the data frame +``` + +**Indexing:** + +```julia +julia> df[:, :y_z_max] = maximum.(eachrow(df[:, Not(:x)])) # find maximum value across all rows except for column `x` +3-element Vector{Int64}: + 7 + 8 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×4 DataFrame + Row │ x y z y_z_max + │ Int64 Int64 Int64 Int64 +─────┼────────────────────────────── + 1 │ 1 4 7 7 + 2 │ 2 5 8 8 + 3 │ 3 6 9 9 +``` + +**Manipulation:** + +```julia +julia> transform!(df, Not(:x) => ByRow(max)) # find maximum value across all rows except for column `x` +3×4 DataFrame + Row │ x y z y_z_max + │ Int64 Int64 Int64 Int64 +─────┼────────────────────────────── + 1 │ 1 4 7 7 + 2 │ 2 5 8 8 + 3 │ 3 6 9 9 +``` + +Moreover, indexing can operate on a subset of columns *and* rows. + +**Indexing:** + +```julia +julia> y_z_max_row3 = maximum(df[3, Not(:x)]) # find maximum value across row 3 except for column `x` +9 +``` + +Hopefully this small comparison has illustrated some of the benefits and drawbacks +of the various syntaxes available in DataFrames.jl. +The best syntax to use depends on the situation.