From 72d87d26d9391026baf8baaeb601fd857bf62fb9 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 13 Oct 2023 17:04:13 -0400 Subject: [PATCH] Move back to basics.md and add comparison --- docs/make.jl | 1 - docs/src/index.md | 7 - docs/src/man/basics.md | 2108 ++++++++++++++++++------ docs/src/man/manipulation_functions.md | 1431 ---------------- 4 files changed, 1568 insertions(+), 1979 deletions(-) delete mode 100644 docs/src/man/manipulation_functions.md diff --git a/docs/make.jl b/docs/make.jl index d854981e2c..fa64782dac 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -34,7 +34,6 @@ makedocs( "Data manipulation frameworks" => "man/querying_frameworks.md", "Comparison with Python/R/Stata" => "man/comparisons.md" ], - "A Gentle Introduction to Data Frame Manipulation Functions" => "man/manipulation_functions.md", "API" => Any[ "Types" => "lib/types.md", "Functions" => "lib/functions.md", diff --git a/docs/src/index.md b/docs/src/index.md index e259fd7f13..66ed6f3e5f 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -229,13 +229,6 @@ Pages = ["man/basics.md", Depth = 2 ``` -## A Gentle Introduction to Data Frame Manipulation Functions - -```@contents -Pages = ["man/manipulation_functions.md"] -Depth = 1 -``` - ## API Only exported (i.e. available for use without `DataFrames.` qualifier after diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 4e8ba02f75..55937b849b 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1565,599 +1565,1627 @@ julia> german[Not(5), r"S"] 984 rows omitted ``` -## Basic Usage of Manipulation Functions - -In DataFrames.jl there are seven functions -which can be used to perform operations on data frame columns: - -- `combine`: creates a new data frame populated with columns that result from - operations applied to the source data frame columns, potentially combining - its rows; -- `select`: creates a new data frame that has the same number of rows as the - source data frame populated with columns that result from operations - applied to the source data frame columns; -- `select!`: the same as `select` but updates the passed data frame in place; -- `transform`: the same as `select` but keeps the columns that were already - present in the data frame (note though that these columns can be potentially - modified by the transformation passed to `transform`); -- `transform!`: the same as `transform` but updates the passed data frame in - place. -- `subset`: creates a new data frame populated with the same columns -as the source data frame, but with only the rows where the passed operations are true; -- `subset!`: the same as `subset` but updates the passed data frame in place; - -!!! Note Other Resources - * For formal, comprehensive explanations of all manipulation functions, - see the [Functions](@ref) API. - * For an informal, long-form tutorial on these functions, - see [A Gentle Introduction to Data Frame Manipulation Functions](@ref). - -Let us now move straight to examples using the German dataset. +## Manipulation Functions -```jldoctest dataframe -julia> using Statistics +The seven functions below can be used to manipulate data frames +by applying operations to them. + +The functions without a `!` in their name +will create a new data frame based on the source data frame, +so you will probably want to store the new data frame to a new variable name, +e.g. `new_df = transform(source_df, operation)`. +The functions with a `!` at the end of their name +will modify an existing data frame in-place, +so there is typically no need to assign the result to a variable, +e.g. `transform!(source_df, operation)` instead of +`source_df = transform(source_df, operation)`. + +The number of columns and rows in the resultant data frame varies +depending on the manipulation function employed. + +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | +| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | +| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | + +### Constructing Operations + +All of the functions above use the same syntax which is commonly +`manipulation_function(dataframe, operation)`. +The `operation` argument defines the +operation to be applied to the source `dataframe`, +and it can take any of the following common forms explained below: + +`source_column_selector` +: selects source column(s) without manipulating or renaming them + + Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` + +`source_column_selector => operation_function` +: passes source column(s) as arguments to a function +and automatically names the resulting column(s) + + Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` + +`source_column_selector => operation_function => new_column_names` +: passes source column(s) as arguments to a function +and names the resulting column(s) `new_column_names` + + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` + + *(Not available for `subset` or `subset!`)* + +`source_column_selector => new_column_names` +: renames a source column, +or splits a column containing collection elements into multiple new columns + + Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` + + (*Not available for `subset` or `subset!`*) + +The `=>` operator constructs a +[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), +which is a type to link one object to another. +(Pairs are commonly used to create elements of a +[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) +In DataFrames.jl manipulation functions, +`Pair` arguments are used to define column `operations` to be performed. +The examples shown above will be explained in more detail later. + +*The manipulation functions also have methods for applying multiple operations. +See the later sections [Applying Multiple Operations per Manipulation](@ref) +and [Broadcasting Operation Pairs](@ref) for more information.* + +#### `source_column_selector` +Inside an `operation`, `source_column_selector` is usually a column name +or column index which identifies a data frame column. + +`source_column_selector` may be used as the entire `operation` +with `select` or `select!` to isolate or reorder columns. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) +3×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 + +julia> select(df, :b) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, "b") +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, 2) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 +``` + +`source_column_selector` may also be used as the entire `operation` +with `subset` or `subset!` if the source column contains `Bool` values. + +```julia +julia> df = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + ) +4×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Scott false + 2 │ Jill true + 3 │ Erica false + 4 │ Jimmy true + +julia> subset(df, :minor) +2×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Jill true + 2 │ Jimmy true +``` + +`source_column_selector` may instead be a collection of columns such as a vector, +a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), +a `Not`, `Between`, `All`, or `Cols` expression, +or a `:`. +See the [Indexing](@ref) API for the full list of possible values with references. + +!!! Note + The Julia parser sometimes prevents `:` from being used by itself. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. -julia> combine(german, :Age => mean => :mean_age) +```julia +julia> df = DataFrame( + id = [1, 2, 3], + first_name = ["José", "Emma", "Nathan"], + last_name = ["Garcia", "Marino", "Boyer"], + age = [61, 24, 33] + ) +3×4 DataFrame + Row │ id first_name last_name age + │ Int64 String String Int64 +─────┼───────────────────────────────────── + 1 │ 1 José Garcia 61 + 2 │ 2 Emma Marino 24 + 3 │ 3 Nathan Boyer 33 + +julia> select(df, [:last_name, :first_name]) +3×2 DataFrame + Row │ last_name first_name + │ String String +─────┼─────────────────────── + 1 │ Garcia José + 2 │ Marino Emma + 3 │ Boyer Nathan + +julia> select(df, r"name") +3×2 DataFrame + Row │ first_name last_name + │ String String +─────┼─────────────────────── + 1 │ José Garcia + 2 │ Emma Marino + 3 │ Nathan Boyer + +julia> select(df, Not(:id)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> select(df, Between(2,4)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> df2 = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + male = [true, false, false, true], + ) +4×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼────────────────────── + 1 │ Scott false true + 2 │ Jill true false + 3 │ Erica false false + 4 │ Jimmy true true + +julia> subset(df2, [:minor, :male]) +1×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼───────────────────── + 1 │ Jimmy true true +``` + +!!! Note + Using `Symbol` in `source_column_selector` will perform slightly faster than using `String`. + However, `String` is convenient when column names contain spaces. + + All elements of `source_column_selector` must be the same type + (unless wrapped in `Cols`), + e.g. `subset(df2, [:minor, "male"])` will error + since `Symbol` and `String` are used simultaneously.) + +#### `operation_function` +Inside an `operation` pair, `operation_function` is a function +which operates on data frame columns passed as vectors. +When multiple columns are selected by `source_column_selector`, +the `operation_function` will receive the columns as separate positional arguments +in the order they were selected, e.g. `f(column1, column2, column3)`. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 4 + +julia> combine(df, :a => sum) 1×1 DataFrame - Row │ mean_age + Row │ a_sum + │ Int64 +─────┼─────── + 1 │ 6 + +julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows +3×3 DataFrame + Row │ a b b_maximum + │ Int64 Int64 Int64 +─────┼───────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 5 + 3 │ 3 4 5 + +julia> transform(df, [:b, :a] => -) # vector subtraction is okay +3×3 DataFrame + Row │ a b b_a_- + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 3 + 2 │ 2 5 3 + 3 │ 3 4 1 + +julia> transform(df, [:a, :b] => *) # vector multiplication is not defined +ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) +``` + +Don't worry! There is a quick fix for the previous error. +If you want to apply a function to each element in a column +instead of to the entire column vector, +then you can wrap your element-wise function in `ByRow` like +`ByRow(my_elementwise_function)`. +This will apply `my_elementwise_function` to every element in the column +and then collect the results back into a vector. + +```julia +julia> transform(df, [:a, :b] => ByRow(*)) +3×3 DataFrame + Row │ a b a_b_* + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 4 + 2 │ 2 5 10 + 3 │ 3 4 12 + +julia> transform(df, Cols(:) => ByRow(max)) +3×3 DataFrame + Row │ a b a_b_max + │ Int64 Int64 Int64 +─────┼─────────────────────── + 1 │ 1 4 4 + 2 │ 2 5 5 + 3 │ 3 4 4 + +julia> f(x) = x + 1 +f (generic function with 1 method) + +julia> transform(df, :a => ByRow(f)) +3×3 DataFrame + Row │ a b a_f + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +Alternatively, you may just want to define the function itself so it +[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +over vectors. + +```julia +julia> g(x) = x .+ 1 +g (generic function with 1 method) + +julia> transform(df, :a => g) +3×3 DataFrame + Row │ a b a_g + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 + +julia> h(x, y) = x .+ y .+ 1 +h (generic function with 1 method) + +julia> transform(df, [:a, :b] => h) +3×3 DataFrame + Row │ a b a_b_h + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 6 + 2 │ 2 5 8 + 3 │ 3 4 8 +``` + +[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) +are a convenient way to define and use an `operation_function` +all within the manipulation function call. + +```julia +julia> select(df, :a => ByRow(x -> x + 1)) +3×1 DataFrame + Row │ a_function + │ Int64 +─────┼──────────── + 1 │ 2 + 2 │ 3 + 3 │ 4 + +julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) +3×3 DataFrame + Row │ a b a_b_function + │ Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 + +julia> subset(df, :b => ByRow(x -> x < 5)) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 + +julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 +``` + +!!! Note + `operation_functions` within `subset` or `subset!` function calls + must return a Boolean vector. + `true` elements in the Boolean vector will determine + which rows are retained in the resulting data frame. + +As demonstrated above, `DataFrame` columns are usually passed +from `source_column_selector` to `operation_function` as one or more +vector arguments. +However, when `AsTable(source_column_selector)` is used, +the selected columns are collected and passed as a single `NamedTuple` +to `operation_function`. + +This is often useful when your `operation_function` is defined to operate +on a single collection argument rather than on multiple positional arguments. +The distinction is somewhat similar to the difference between the built-in +`min` and `minimum` functions. +`min` is defined to find the minimum value among multiple positional arguments, +while `minimum` is defined to find the minimum value +among the elements of a single collection argument. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 2 + 2 │ 2 4 6 1 + +julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments +2×1 DataFrame + Row │ a_b_etc_min + │ Int64 +─────┼───────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection +2×1 DataFrame + Row │ a_b_etc_minimum + │ Int64 +─────┼───────────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments +2×1 DataFrame + Row │ a_b_+ + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 6 + +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection +2×1 DataFrame + Row │ a_b_sum + │ Int64 +─────┼───────── + 1 │ 4 + 2 │ 6 + +julia> using Statistics # contains the `mean` function + +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection +2×1 DataFrame + Row │ b_c_d_mean │ Float64 -─────┼────────── - 1 │ 35.546 +─────┼──────────── + 1 │ 3.33333 + 2 │ 3.66667 +``` -julia> select(german, :Age => mean => :mean_age) -1000×1 DataFrame - Row │ mean_age - │ Float64 -──────┼────────── - 1 │ 35.546 - 2 │ 35.546 - 3 │ 35.546 - 4 │ 35.546 - 5 │ 35.546 - 6 │ 35.546 - 7 │ 35.546 - 8 │ 35.546 - ⋮ │ ⋮ - 994 │ 35.546 - 995 │ 35.546 - 996 │ 35.546 - 997 │ 35.546 - 998 │ 35.546 - 999 │ 35.546 - 1000 │ 35.546 - 985 rows omitted -``` - -As you can see in both cases the `mean` function was applied to `:Age` column -and the result was stored in the `:mean_age` column. The difference between -the `combine` and `select` functions is that the `combine` aggregates data -and produces as many rows as were returned by the transformation function. -On the other hand the `select` function always keeps the number of rows in a -data frame to be the same as in the source data frame. Therefore in this case -the result of the `mean` function got broadcasted. - -As `combine` potentially allows any number of rows to be produced as a result -of the transformation if we have a combination of transformations where some of -them produce a vector, and other produce scalars then scalars get broadcasted -exactly like in `select`. Here is an example: +`AsTable` can also be used to pass columns to a function which operates +on fields of a `NamedTuple`. -```jldoctest dataframe -julia> combine(german, :Age => mean => :mean_age, :Housing => unique => :housing) -3×2 DataFrame - Row │ mean_age housing - │ Float64 String7 -─────┼─────────────────── - 1 │ 35.546 own - 2 │ 35.546 free - 3 │ 35.546 rent +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 7 + 2 │ 2 4 6 8 + +julia> f(nt) = nt.a + nt.d +f (generic function with 1 method) + +julia> transform(df, AsTable(:) => ByRow(f)) +2×5 DataFrame + Row │ a b c d a_b_etc_f + │ Int64 Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────── + 1 │ 1 3 5 7 8 + 2 │ 2 4 6 8 10 ``` -Note, however, that it is not allowed to return vectors of different lengths in -different transformations: +As demonstrated above, +in the `source_column_selector => operation_function` operation pair form, +the results of an operation will be placed into a new column with an +automatically-generated name based on the operation; +the new column name will be the `operation_function` name +appended to the source column name(s) with an underscore. -```jldoctest dataframe -julia> combine(german, :Age, :Housing => unique => :Housing) -ERROR: ArgumentError: New columns must have the same length as old columns +This automatic column naming behavior can be avoided in two ways. +First, the operation result can be placed back into the original column +with the original column name by switching the keyword argument `renamecols` +from its default value (`true`) to `renamecols=false`. +This option prevents the function name from being appended to the column name +as it usually would be. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 11 5 + 2 │ 12 6 + 3 │ 13 7 + 4 │ 14 8 ``` -Let us discuss some other examples using `select`. Often we want to apply some -function not to the whole column of a data frame, but rather to its individual -elements. Normally we can achieve this using broadcasting like this: +The second method to avoid the default manipulation column naming is to +specify your own `new_column_names`. -```jldoctest dataframe -julia> select(german, :Sex => (x -> uppercase.(x)) => :Sex) -1000×1 DataFrame - Row │ Sex - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +#### `new_column_names` + +`new_column_names` can be included at the end of an `operation` pair to specify +the name of the new column(s). +`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, Cols(:) => ByRow(+) => :c) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, Cols(:) => ByRow(+) => "a+b") +4×3 DataFrame + Row │ a b a+b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, :a => ByRow(x->x+10) => "a+10") +4×3 DataFrame + Row │ a b a+10 + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 11 + 2 │ 2 6 12 + 3 │ 3 7 13 + 4 │ 4 8 14 ``` -This pattern is encountered very often in practice, therefore there is a `ByRow` -convenience wrapper for a function that creates its broadcasted variant. In -these examples `ByRow` is a special type used for selection operations to signal -that the wrapped function should be applied to each element (row) of the -selection. Here we are passing `ByRow` wrapper to target column name `:Sex` -using `uppercase` function: +The `source_column_selector => new_column_names` operation form +can be used to rename columns without an intermediate function. +However, there are `rename` and `rename!` functions, +which accept similar syntax, +that tend to be more useful for this operation. -```jldoctest dataframe -julia> select(german, :Sex => ByRow(uppercase) => :SEX) -1000×1 DataFrame - Row │ SEX - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => :apple) # adds column `apple` +4×3 DataFrame + Row │ a b apple + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 + +julia> select(df, :a => :apple) # retains only column `apple` +4×1 DataFrame + Row │ apple + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + 4 │ 4 + +julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place +4×2 DataFrame + Row │ apple b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 ``` -In this case we transform our source column `:Age` using `ByRow` wrapper and -automatically generate the target column name: +If `new_column_names` already exist in the source data frame, +those columns will be replaced in the existing column location +rather than being added to the end. +This can be done by manually specifying an existing column name +or by using the `renamecols=false` keyword argument. -```jldoctest dataframe -julia> select(german, :Age, :Age => ByRow(sqrt)) -1000×2 DataFrame - Row │ Age Age_sqrt - │ Int64 Float64 -──────┼───────────────── - 1 │ 67 8.18535 - 2 │ 22 4.69042 - 3 │ 49 7.0 - 4 │ 45 6.7082 - 5 │ 53 7.28011 - 6 │ 35 5.91608 - 7 │ 53 7.28011 - 8 │ 35 5.91608 - ⋮ │ ⋮ ⋮ - 994 │ 30 5.47723 - 995 │ 50 7.07107 - 996 │ 31 5.56776 - 997 │ 40 6.32456 - 998 │ 38 6.16441 - 999 │ 23 4.79583 - 1000 │ 27 5.19615 - 985 rows omitted +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name +4×3 DataFrame + Row │ a b b_function + │ Int64 Int64 Int64 +─────┼────────────────────────── + 1 │ 1 5 15 + 2 │ 2 6 16 + 3 │ 3 7 17 + 4 │ 4 8 18 + +julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 15 + 2 │ 2 16 + 3 │ 3 17 + 4 │ 4 18 + +julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 15 5 + 2 │ 16 6 + 3 │ 17 7 + 4 │ 18 8 ``` -When we pass just a column (without the `=>` part) we can use any column selector -that is allowed in indexing. +Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. -Here we exclude the column `:Age` from the resulting data frame: +```julia +julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name +4×3 DataFrame + Row │ a b a_b_+ + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name +4×3 DataFrame + Row │ a b a_b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 -```jldoctest dataframe -julia> select(german, Not(:Age)) -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 6 5 + 2 │ 8 6 + 3 │ 10 7 + 4 │ 12 8 ``` -In the next example we drop columns `"Age"`, `"Saving accounts"`, -`"Checking account"`, `"Credit amount"`, and `"Purpose"`. Note that this time -we use string column selectors because some of the column names have spaces -in them: +In the `source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may also be a renaming function which operates on a string +to create the destination column names programmatically. -```jldoctest dataframe -julia> select(german, Not(["Age", "Saving accounts", "Checking account", - "Credit amount", "Purpose"])) -1000×5 DataFrame - Row │ id Sex Job Housing Duration - │ Int64 String7 Int64 String7 Int64 -──────┼────────────────────────────────────────── - 1 │ 0 male 2 own 6 - 2 │ 1 female 2 own 48 - 3 │ 2 male 1 own 12 - 4 │ 3 male 2 free 42 - 5 │ 4 male 2 free 24 - 6 │ 5 male 1 free 36 - 7 │ 6 male 2 own 24 - 8 │ 7 male 3 rent 36 - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ - 994 │ 993 male 3 own 36 - 995 │ 994 male 2 own 12 - 996 │ 995 female 1 own 12 - 997 │ 996 male 3 own 30 - 998 │ 997 male 2 own 12 - 999 │ 998 male 2 free 45 - 1000 │ 999 male 2 own 45 - 985 rows omitted - -``` - -As another example let us present that the `r"S"` regular expression we used -above also works with `select`: +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 -```jldoctest dataframe -julia> select(german, r"S") -1000×2 DataFrame - Row │ Sex Saving accounts - │ String7 String15 -──────┼────────────────────────── - 1 │ male NA - 2 │ female little - 3 │ male little - 4 │ male little - 5 │ male little - 6 │ male NA - 7 │ male quite rich - 8 │ male little - ⋮ │ ⋮ ⋮ - 994 │ male little - 995 │ male NA - 996 │ female little - 997 │ male little - 998 │ male little - 999 │ male little - 1000 │ male moderate - 985 rows omitted -``` - -The benefit of `select` or `combine` over indexing is that it is easier -to get the union of several column selectors, e.g.: +julia> add_prefix(s) = "new_" * s +add_prefix (generic function with 1 method) -```jldoctest dataframe -julia> select(german, r"S", "Job", 1) -1000×4 DataFrame - Row │ Sex Saving accounts Job id - │ String7 String15 Int64 Int64 -──────┼──────────────────────────────────────── - 1 │ male NA 2 0 - 2 │ female little 2 1 - 3 │ male little 1 2 - 4 │ male little 2 3 - 5 │ male little 2 4 - 6 │ male NA 1 5 - 7 │ male quite rich 2 6 - 8 │ male little 3 7 - ⋮ │ ⋮ ⋮ ⋮ ⋮ - 994 │ male little 3 993 - 995 │ male NA 2 994 - 996 │ female little 1 995 - 997 │ male little 3 996 - 998 │ male little 2 997 - 999 │ male little 2 998 - 1000 │ male moderate 2 999 - 985 rows omitted -``` - -Taking advantage of this flexibility here is an idiomatic pattern to move some -column to the front of a data frame: +julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 + +julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 +``` + +!!! Note + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` + +A renaming function will not work in the +`source_column_selector => new_column_names` operation form +because a function in the second element of the operation pair is assumed to take +the `source_column_selector => operation_function` operation form. +To work around this limitation, use the +`source_column_selector => operation_function => new_column_names` operation form +with `identity` as the `operation_function`. -```jldoctest dataframe -julia> select(german, "Sex", :) -1000×10 DataFrame - Row │ Sex id Age Job Housing Saving accounts Checking accou ⋯ - │ String7 Int64 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ male 0 67 2 own NA little ⋯ - 2 │ female 1 22 2 own little moderate - 3 │ male 2 49 1 own little NA - 4 │ male 3 45 2 free little little - 5 │ male 4 53 2 free little little ⋯ - 6 │ male 5 35 1 free NA NA - 7 │ male 6 53 2 own quite rich NA - 8 │ male 7 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ male 993 30 3 own little little ⋯ - 995 │ male 994 50 2 own NA NA - 996 │ female 995 31 1 own little NA - 997 │ male 996 40 3 own little little - 998 │ male 997 38 2 own little NA ⋯ - 999 │ male 998 23 2 free little little - 1000 │ male 999 27 2 own moderate moderate - 4 columns and 985 rows omitted +```julia +julia> transform(df, :a => add_prefix) +ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) + +julia> transform(df, :a => identity => add_prefix) +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 ``` -Below, we are simply passing source column and target column name to rename them -(without specifying the transformation part): +In this case though, +it is probably again more useful to use the `rename` or `rename!` function +rather than one of the manipulation functions +in order to rename in-place and avoid the intermediate `operation_function`. +```julia +julia> rename(add_prefix, df) # rename all columns with a function +4×2 DataFrame + Row │ new_a new_b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> rename(add_prefix, df; cols=:a) # rename some columns with a function +4×2 DataFrame + Row │ new_a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` -```jldoctest dataframe -julia> select(german, :Sex => :x1, :Age => :x2) -1000×2 DataFrame - Row │ x1 x2 - │ String7 Int64 -──────┼──────────────── - 1 │ male 67 - 2 │ female 22 - 3 │ male 49 - 4 │ male 45 - 5 │ male 53 - 6 │ male 35 - 7 │ male 53 - 8 │ male 35 - ⋮ │ ⋮ ⋮ - 994 │ male 30 - 995 │ male 50 - 996 │ female 31 - 997 │ male 40 - 998 │ male 38 - 999 │ male 23 - 1000 │ male 27 - 985 rows omitted +In the `source_column_selector => new_column_names` operation form, +only a single source column may be selected per operation, +so why is `new_column_names` plural? +It is possible to split the data contained inside a single column +into multiple new columns by supplying a vector of strings or symbols +as `new_column_names`. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> transform(df, :data => [:first, :second]) # manual naming +2×3 DataFrame + Row │ data first second + │ Tuple… Int64 Int64 +─────┼─────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 ``` -It is important to note that `select` always returns a data frame, even if a -single column selected as opposed to indexing syntax. Compare the following: +This kind of data splitting can even be done automatically with `AsTable`. -```jldoctest dataframe -julia> select(german, :Age) -1000×1 DataFrame - Row │ Age - │ Int64 -──────┼─────── - 1 │ 67 - 2 │ 22 - 3 │ 49 - 4 │ 45 - 5 │ 53 - 6 │ 35 - 7 │ 53 - 8 │ 35 - ⋮ │ ⋮ - 994 │ 30 - 995 │ 50 - 996 │ 31 - 997 │ 40 - 998 │ 38 - 999 │ 23 - 1000 │ 27 -985 rows omitted +```julia +julia> transform(df, :data => AsTable) # default automatic naming with tuples +2×3 DataFrame + Row │ data x1 x2 + │ Tuple… Int64 Int64 +─────┼────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` -julia> german[:, :Age] -1000-element Vector{Int64}: - 67 - 22 - 49 - 45 - 53 - 35 - 53 - 35 - 61 - 28 - ⋮ - 34 - 23 - 30 - 50 - 31 - 40 - 38 - 23 - 27 -``` - -By default `select` copies columns of a passed source data frame. In order to -avoid copying, pass the `copycols=false` keyword argument: +If a data frame column contains `NamedTuple`s, +then `AsTable` will preserve the field names. +```julia +julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples +2×1 DataFrame + Row │ data + │ NamedTup… +─────┼──────────────── + 1 │ (a = 1, b = 2) + 2 │ (a = 3, b = 4) -```jldoctest dataframe -julia> df = select(german, :Sex) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +julia> transform(df, :data => AsTable) # keeps names from named tuples +2×3 DataFrame + Row │ data a b + │ NamedTup… Int64 Int64 +─────┼────────────────────────────── + 1 │ (a = 1, b = 2) 1 2 + 2 │ (a = 3, b = 4) 3 4 +``` -julia> df.Sex === german.Sex # copy -false +!!! Note + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. -julia> df = select(german, :Sex, copycols=false) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +Renaming functions also work for multi-column transformations, +but they must operate on a vector of strings. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> new_names(v) = ["primary ", "secondary "] .* v +new_names (generic function with 1 method) + +julia> transform(df, :data => identity => new_names) +2×3 DataFrame + Row │ data primary data secondary data + │ Tuple… Int64 Int64 +─────┼────────────────────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +### Applying Multiple Operations per Manipulation +All data frame manipulation functions can accept multiple `operation` pairs +at once using any of the following methods: +- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments +- `manipulation_function(dataframe, [operation1, operation2])` : vector argument +- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument + +Passing multiple operations is especially useful for the `select`, `select!`, +and `combine` manipulation functions, +since they only retain columns which are a result of the passed operations. + +```julia +julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 1 50 hat + 2 │ 2 50 bat + 3 │ 3 60 cat + 4 │ 4 60 dog + +julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations +1×3 DataFrame + Row │ a_maximum b_sum c_join + │ Int64 Int64 String +─────┼──────────────────────────────── + 1 │ 4 220 hatbatcatdog + +julia> select(df, :c, :b, :a) # re-order columns +4×3 DataFrame + Row │ c b a + │ String Int64 Int64 +─────┼────────────────────── + 1 │ hat 50 1 + 2 │ bat 50 2 + 3 │ cat 60 3 + 4 │ dog 60 4 + +ulia> select(df, :b, :) # `:` here means all other columns +4×3 DataFrame + Row │ b a c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 50 1 hat + 2 │ 50 2 bat + 3 │ 60 3 cat + 4 │ 60 4 dog + +julia> select( + df, + :c => (x -> "a " .* x) => :one_c, + :a => (x -> 100x), + :b, + renamecols=false + ) # can mix operation forms +4×3 DataFrame + Row │ one_c a b + │ String Int64 Int64 +─────┼────────────────────── + 1 │ a hat 100 50 + 2 │ a bat 200 50 + 3 │ a cat 300 60 + 4 │ a dog 400 60 + +julia> select( + df, + :c => ByRow(reverse), + :c => ByRow(uppercase) + ) # multiple operations on same column +4×2 DataFrame + Row │ c_reverse c_uppercase + │ String String +─────┼──────────────────────── + 1 │ tah HAT + 2 │ tab BAT + 3 │ tac CAT + 4 │ god DOG +``` + +In the last two examples, +the manipulation function arguments were split across multiple lines. +This is a good way to make manipulations with many operations more readable. + +Passing multiple operations to `subset` or `subset!` is an easy way to narrow in +on a particular row of data. + +```julia +julia> subset( + df, + :b => ByRow(==(60)), + :c => ByRow(contains("at")) + ) # rows with 60 and "at" +1×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 3 60 cat +``` -julia> df.Sex === german.Sex # no-copy is performed +Note that all operations within a single manipulation must use the data +as it existed before the function call +i.e. you cannot use newly created columns for subsequent operations +within the same manipulation. + +```julia +julia> transform( + df, + [:a, :b] => ByRow(+) => :d, + :d => (x -> x ./ 2), + ) # requires two separate transformations +ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c + +julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) +4×4 DataFrame + Row │ a b c d + │ Int64 Int64 String Int64 +─────┼───────────────────────────── + 1 │ 1 50 hat 51 + 2 │ 2 50 bat 52 + 3 │ 3 60 cat 63 + 4 │ 4 60 dog 64 + +julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) +4×5 DataFrame + Row │ a b c d d_2 + │ Int64 Int64 String Int64 Float64 +─────┼────────────────────────────────────── + 1 │ 1 50 hat 51 25.5 + 2 │ 2 50 bat 52 26.0 + 3 │ 3 60 cat 63 31.5 + 4 │ 4 60 dog 64 32.0 +``` + + +### Broadcasting Operation Pairs + +[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +pairs with `.=>` is often a convenient way to generate multiple +similar `operation`s to be applied within a single manipulation. +Broadcasting within the `Pair` of an `operation` is no different than +broadcasting in base Julia. +The broadcasting `.=>` will be expanded into a vector of pairs +(`[operation1, operation2, ...]`), +and this expansion will occur before the manipulation function is invoked. +Then the manipulation function will use the +`manipulation_function(dataframe, [operation1, operation2, ...])` method. +This process will be explained in more detail below. + +To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. +In DataFrames.jl, a symbol, string, or integer +may be used to select a single column. +Some `Pair`s with these types are below. + +```julia +julia> typeof(:x => :a) +Pair{Symbol, Symbol} + +julia> typeof("x" => "a") +Pair{String, String} + +julia> typeof(1 => "a") +Pair{Int64, String} +``` + +Any of the `Pair`s above could be used to rename the first column +of the data frame below to `a`. + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> select(df, :x => :a) +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> select(df, 1 => "a") +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 +``` + +What should we do if we want to keep and rename both the `x` and `y` column? +One option is to supply a `Vector` of operation `Pair`s to `select`. +`select` will process all of these operations in order. + +```julia +julia> ["x" => "a", "y" => "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x" => "a", "y" => "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +We can use broadcasting to simplify the syntax above. + +```julia +julia> ["x", "y"] .=> ["a", "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x", "y"] .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Notice that `select` sees the same `Vector{Pair{String, String}}` operation +argument whether the individual pairs are written out explicitly or +constructed with broadcasting. +The broadcasting is applied before the call to `select`. + +```julia +julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) true ``` -To perform the selection operation in-place use `select!`: +!!! Note + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` -```jldoctest dataframe -julia> select!(german, Not(:Age)); +In Julia, +a non-vector broadcasted with a vector will be repeated in each resultant pair element. -julia> german -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +```julia +julia> ["x", "y"] .=> :a # :a is repeated +2-element Vector{Pair{String, Symbol}}: + "x" => :a + "y" => :a + +julia> 1 .=> [:a, :b] # 1 is repeated +2-element Vector{Pair{Int64, Symbol}}: + 1 => :a + 1 => :b ``` -As you can see the `:Age` column was dropped from the `german` data frame. +We can use this fact to easily broadcast an `operation_function` to multiple columns. -The `transform` and `transform!` functions work identically to `select` and -`select!` with the only difference that they retain all columns that are present -in the source data frame. Here are some examples: +```julia +julia> f(x) = 2 * x +f (generic function with 1 method) -```jldoctest dataframe -julia> german = copy(german_ref); +julia> ["x", "y"] .=> f # f is repeated +2-element Vector{Pair{String, typeof(f)}}: + "x" => f + "y" => f -julia> df = german_ref[1:8, 1:5] -8×5 DataFrame - Row │ id Age Sex Job Housing - │ Int64 Int64 String7 Int64 String7 -─────┼─────────────────────────────────────── - 1 │ 0 67 male 2 own - 2 │ 1 22 female 2 own - 3 │ 2 49 male 1 own - 4 │ 3 45 male 2 free - 5 │ 4 53 male 2 free - 6 │ 5 35 male 1 free - 7 │ 6 53 male 2 own - 8 │ 7 35 male 3 rent - -julia> transform(df, :Age => maximum) -8×6 DataFrame - Row │ id Age Sex Job Housing Age_maximum - │ Int64 Int64 String7 Int64 String7 Int64 +julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming +3×2 DataFrame + Row │ x_f y_f + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + +julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated +2-element Vector{Pair{String, Pair{typeof(f), String}}}: + "x" => (f => "a") + "y" => (f => "b") + +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +A renaming function can be applied to multiple columns in the same way. +It will also be repeated in each operation `Pair`. + +```julia +julia> newname(s::String) = s * "_new" +newname (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated +2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: + "x" => (f => newname) + "y" => (f => newname) + +julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname +3×2 DataFrame + Row │ x_new y_new + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +You can see from the type output above +that a three element pair does not actually exist. +A `Pair` (as the name implies) can only contain two elements. +Thus, `:x => :y => :z` becomes a nested `Pair`, +where `:x` is the first element and points to the `Pair` `:y => :z`, +which is the second element. + +```julia +julia> p = :x => :y => :z +:x => (:y => :z) + +julia> p[1] +:x + +julia> p[2] +:y => :z + +julia> p[2][1] +:y + +julia> p[2][2] +:z + +julia> p[3] # there is no index 3 for a pair +ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] +``` + +In the previous examples, the source columns have been individually selected. +When broadcasting multiple columns to the same function, +often similarities in the column names or position can be exploited to avoid +tedious selection. +Consider a data frame with temperature data at three different locations +taken over time. +```julia +julia> df = DataFrame(Time = 1:4, + Temperature1 = [20, 23, 25, 28], + Temperature2 = [33, 37, 41, 44], + Temperature3 = [15, 10, 4, 0]) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 20 33 15 + 2 │ 2 23 37 10 + 3 │ 3 25 41 4 + 4 │ 4 28 44 0 +``` + +To convert all of the temperature data in one transformation, +we just need to define a conversion function and broadcast +it to all of the "Temperature" columns. + +```julia +julia> celsius_to_kelvin(x) = x + 273 +celsius_to_kelvin (generic function with 1 method) + +julia> transform( + df, + Cols(r"Temp") .=> ByRow(celsius_to_kelvin), + renamecols = false + ) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` +Or, simultaneously changing the column names: + +```julia +julia> rename_function(s) = "Temperature $(last(s)) (K)" +rename_function (generic function with 1 method) + +julia> select( + df, + "Time", + Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function + ) +4×4 DataFrame + Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` + +!!! Note Notes + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. + +You could also broadcast different columns to different functions +by supplying a vector of functions. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> f1(x) = x .+ 1 +f1 (generic function with 1 method) + +julia> f2(x) = x ./ 10 +f2 (generic function with 1 method) + +julia> transform(df, [:a, :b] .=> [f1, f2]) +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +However, this form is not much more convenient than supplying +multiple individual operations. + +```julia +julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +Perhaps more useful for broadcasting syntax +is to apply multiple functions to multiple columns +by changing the vector of functions to a 1-by-x matrix of functions. +(Recall that a list, a vector, or a matrix of operation pairs are all valid +for passing to the manipulation functions.) + +```julia +julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 +2×2 Matrix{Pair{Symbol}}: + :a=>f1 :a=>f2 + :b=>f1 :b=>f2 + +julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 +4×6 DataFrame + Row │ a b a_f1 b_f1 a_f2 b_f2 + │ Int64 Int64 Int64 Int64 Float64 Float64 +─────┼────────────────────────────────────────────── + 1 │ 1 5 2 6 0.1 0.5 + 2 │ 2 6 3 7 0.2 0.6 + 3 │ 3 7 4 8 0.3 0.7 + 4 │ 4 8 5 9 0.4 0.8 +``` + +In this way, every combination of selected columns and functions will be applied. + +Pair broadcasting is a simple but powerful tool +that can be used in any of the manipulation functions listed under +[Basic Usage of Manipulation Functions](@ref). +Experiment for yourself to discover other useful operations. + +### Additional Resources +More details and examples of operation pair syntax can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). +(The official wording describing the syntax has changed since the blog post was written, +but the examples are still illustrative. +The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language +or Domain-Specific Language.) + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. + +## Approach Comparison + +After that deep dive into [Manipulation Functions](@ref), +it is a good idea to review the alternative approaches covered in +[Getting and Setting Data in a Data Frame](@ref). +Let us compare the two approaches with a few examples. + +### Convenience + +For simple operations, +often getting/setting data with dot syntax +is simpler than the equivalent data frame manipulation. +Here we will add the two columns of our data frame together +and place the result in a new third column. + +Setup: + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) # define data frame +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Manipulation: + +```julia +julia> transform!(df, [:x, :y] => (+) => :z) +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Dot Syntax: + +```julia +julia> df.x # dot syntax returns a vector +3-element Vector{Int64}: + 1 + 2 + 3 + +julia> df.z = df.x + df.y +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Recall that the return type from a data frame manipulation function call is always a `DataFrame`. +The return type of a data frame column accessed with dot syntax is a `Vector`. +Thus the expression `df.x + df.y` gets the column data as vectors +and returns the result of the vector addition. +However, in that same line, +we assigned the resultant `Vector` to a new column `z` in the data frame `df`. +We could have instead assigned the resultant `Vector` to some other variable, +and then `df` would not have been altered. +The approach with dot syntax is very versatile +since the data getting, mathematics, and data setting can be separate steps. + +```julia +julia> df.x +3-element Vector{Int64}: + 1 + 2 + 3 + +julia> v = df.x + df.y +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df.z = v +3-element Vector{Int64}: + 5 + 7 + 9 +``` + +One downside to dot syntax is that the column name must be explicitly written in the code. +Indexing syntax can perform a similar operation with dynamic column names. +(Manipulation functions can also work with dynamic column names as will be shown in the next example.) + +```julia +julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # define data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names + +# Imagine the above data was read from a file or entered by a user at runtime. + +julia> df.c1 # dot syntax expects an explicit column name and cannot be used +ERROR: ArgumentError: column name :c1 not found in the data frame + +julia> df[:, c3] = df[:, c1] + df[:, c2] # access columns with names stored in variables +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 ─────┼──────────────────────────────────────────────────── - 1 │ 0 67 male 2 own 67 - 2 │ 1 22 female 2 own 67 - 3 │ 2 49 male 1 own 67 - 4 │ 3 45 male 2 free 67 - 5 │ 4 53 male 2 free 67 - 6 │ 5 35 male 1 free 67 - 7 │ 6 53 male 2 own 67 - 8 │ 7 35 male 3 rent 67 + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 ``` -In the example below we are swapping values stored in columns `:Sex` and `:Age`: +One benefit of using manipulation functions is that +the name of the data frame only needs to be written once. -```jldoctest dataframe -julia> transform(german, :Age => :Sex, :Sex => :Age) -1000×10 DataFrame - Row │ id Age Sex Job Housing Saving accounts Checking accou ⋯ - │ Int64 String7 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 67 2 own NA little ⋯ - 2 │ 1 female 22 2 own little moderate - 3 │ 2 male 49 1 own little NA - 4 │ 3 male 45 2 free little little - 5 │ 4 male 53 2 free little little ⋯ - 6 │ 5 male 35 1 free NA NA - 7 │ 6 male 53 2 own quite rich NA - 8 │ 7 male 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 30 3 own little little ⋯ - 995 │ 994 male 50 2 own NA NA - 996 │ 995 female 31 1 own little NA - 997 │ 996 male 40 3 own little little - 998 │ 997 male 38 2 own little NA ⋯ - 999 │ 998 male 23 2 free little little - 1000 │ 999 male 27 2 own moderate moderate - 4 columns and 985 rows omitted +Setup: + +```julia +julia> my_very_long_data_frame_name = DataFrame( + "My First Column" => 1:3, + "My Second Column" => 4:6 + ) # define data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names ``` -If we give more than one source column to a transformation they are passed as -consecutive positional arguments. So for example the -`[:Age, :Job] => (+) => :res` transformation below evaluates `+(df1.Age, df1.Job)` -(which adds two columns) and stores the result in the `:res` column: +Manipulation: -```jldoctest dataframe -julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) -1000×3 DataFrame - Row │ Age Job res - │ Int64 Int64 Int64 -──────┼───────────────────── - 1 │ 67 2 69 - 2 │ 22 2 24 - 3 │ 49 1 50 - 4 │ 45 2 47 - 5 │ 53 2 55 - 6 │ 35 1 36 - 7 │ 53 2 55 - 8 │ 35 3 38 - ⋮ │ ⋮ ⋮ ⋮ - 994 │ 30 3 33 - 995 │ 50 2 52 - 996 │ 31 1 32 - 997 │ 40 3 43 - 998 │ 38 2 40 - 999 │ 23 2 25 - 1000 │ 27 2 29 - 985 rows omitted -``` - -This concludes the introductory examples of data frame manipulations. -See later sections of the manual, -particularly [A Gentle Introduction to Data Frame Manipulation Functions](@ref), -for additional explanations and functionality, -including how to broadcast operation functions and operation pairs -and how to pass or produce multiple columns using `AsTable`. +```julia + +julia> transform!(my_very_long_data_frame_name, [c1, c2] => (+) => c3) +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Indexing: + +```julia +julia> my_very_long_data_frame_name[:, c3] = my_very_long_data_frame_name[:, c1] + my_very_long_data_frame_name[:, c2] +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +### Speed + +TODO: Compare speed, memory, and view options (@view, !, :, copycols=false). +(May need someone else to write this part unless I do more studying.) diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md deleted file mode 100644 index 72df944763..0000000000 --- a/docs/src/man/manipulation_functions.md +++ /dev/null @@ -1,1431 +0,0 @@ -# A Gentle Introduction to Data Frame Manipulation Functions - -The seven functions below can be used to manipulate data frames -by applying operations to them. -This section of the documentation aims to methodically build understanding -of these functions and their possible arguments -by reinforcing foundational concepts and slowly increasing complexity. - -The functions without a `!` in their name -will create a new data frame based on the source data frame, -so you will probably want to store the new data frame to a new variable name, -e.g. `new_df = transform(source_df, operation)`. -The functions with a `!` at the end of their name -will modify an existing data frame in-place, -so there is typically no need to assign the result to a variable, -e.g. `transform!(source_df, operation)` instead of -`source_df = transform(source_df, operation)`. - -The number of columns and rows in the resultant data frame varies -depending on the manipulation function employed. - -| Function | Memory Usage | Column Retention | Row Retention | -| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | -| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | -| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | -| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | - -## Constructing Operations - -All of the functions above use the same syntax which is commonly -`manipulation_function(dataframe, operation)`. -The `operation` argument defines the -operation to be applied to the source `dataframe`, -and it can take any of the following common forms explained below: - -`source_column_selector` -: selects source column(s) without manipulating or renaming them - - Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` - -`source_column_selector => operation_function` -: passes source column(s) as arguments to a function -and automatically names the resulting column(s) - - Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` - -`source_column_selector => operation_function => new_column_names` -: passes source column(s) as arguments to a function -and names the resulting column(s) `new_column_names` - - Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` - - *(Not available for `subset` or `subset!`)* - -`source_column_selector => new_column_names` -: renames a source column, -or splits a column containing collection elements into multiple new columns - - Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` - - (*Not available for `subset` or `subset!`*) - -The `=>` operator constructs a -[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), -which is a type to link one object to another. -(Pairs are commonly used to create elements of a -[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) -In DataFrames.jl manipulation functions, -`Pair` arguments are used to define column `operations` to be performed. -The examples shown above will be explained in more detail later. - -*The manipulation functions also have methods for applying multiple operations. -See the later sections [Applying Multiple Operations per Manipulation](@ref) -and [Broadcasting Operation Pairs](@ref) for more information.* - -### `source_column_selector` -Inside an `operation`, `source_column_selector` is usually a column name -or column index which identifies a data frame column. - -`source_column_selector` may be used as the entire `operation` -with `select` or `select!` to isolate or reorder columns. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) -3×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 7 - 2 │ 2 5 8 - 3 │ 3 6 9 - -julia> select(df, :b) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, "b") -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, 2) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 -``` - -`source_column_selector` may also be used as the entire `operation` -with `subset` or `subset!` if the source column contains `Bool` values. - -```julia -julia> df = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - ) -4×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Scott false - 2 │ Jill true - 3 │ Erica false - 4 │ Jimmy true - -julia> subset(df, :minor) -2×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Jill true - 2 │ Jimmy true -``` - -`source_column_selector` may instead be a collection of columns such as a vector, -a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), -a `Not`, `Between`, `All`, or `Cols` expression, -or a `:`. -See the [Indexing](@ref) API for the full list of possible values with references. - -!!! Note - The Julia parser sometimes prevents `:` from being used by itself. - If you get - `ERROR: syntax: whitespace not allowed after ":" used for quoting`, - try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. - -```julia -julia> df = DataFrame( - id = [1, 2, 3], - first_name = ["José", "Emma", "Nathan"], - last_name = ["Garcia", "Marino", "Boyer"], - age = [61, 24, 33] - ) -3×4 DataFrame - Row │ id first_name last_name age - │ Int64 String String Int64 -─────┼───────────────────────────────────── - 1 │ 1 José Garcia 61 - 2 │ 2 Emma Marino 24 - 3 │ 3 Nathan Boyer 33 - -julia> select(df, [:last_name, :first_name]) -3×2 DataFrame - Row │ last_name first_name - │ String String -─────┼─────────────────────── - 1 │ Garcia José - 2 │ Marino Emma - 3 │ Boyer Nathan - -julia> select(df, r"name") -3×2 DataFrame - Row │ first_name last_name - │ String String -─────┼─────────────────────── - 1 │ José Garcia - 2 │ Emma Marino - 3 │ Nathan Boyer - -julia> select(df, Not(:id)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> select(df, Between(2,4)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> df2 = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - male = [true, false, false, true], - ) -4×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼────────────────────── - 1 │ Scott false true - 2 │ Jill true false - 3 │ Erica false false - 4 │ Jimmy true true - -julia> subset(df2, [:minor, :male]) -1×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼───────────────────── - 1 │ Jimmy true true -``` - -### `operation_function` -Inside an `operation` pair, `operation_function` is a function -which operates on data frame columns passed as vectors. -When multiple columns are selected by `source_column_selector`, -the `operation_function` will receive the columns as separate positional arguments -in the order they were selected, e.g. `f(column1, column2, column3)`. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 4 - -julia> combine(df, :a => sum) -1×1 DataFrame - Row │ a_sum - │ Int64 -─────┼─────── - 1 │ 6 - -julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows -3×3 DataFrame - Row │ a b b_maximum - │ Int64 Int64 Int64 -─────┼───────────────────────── - 1 │ 1 4 5 - 2 │ 2 5 5 - 3 │ 3 4 5 - -julia> transform(df, [:b, :a] => -) # vector subtraction is okay -3×3 DataFrame - Row │ a b b_a_- - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 3 - 2 │ 2 5 3 - 3 │ 3 4 1 - -julia> transform(df, [:a, :b] => *) # vector multiplication is not defined -ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) -``` - -Don't worry! There is a quick fix for the previous error. -If you want to apply a function to each element in a column -instead of to the entire column vector, -then you can wrap your element-wise function in `ByRow` like -`ByRow(my_elementwise_function)`. -This will apply `my_elementwise_function` to every element in the column -and then collect the results back into a vector. - -```julia -julia> transform(df, [:a, :b] => ByRow(*)) -3×3 DataFrame - Row │ a b a_b_* - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 4 - 2 │ 2 5 10 - 3 │ 3 4 12 - -julia> transform(df, Cols(:) => ByRow(max)) -3×3 DataFrame - Row │ a b a_b_max - │ Int64 Int64 Int64 -─────┼─────────────────────── - 1 │ 1 4 4 - 2 │ 2 5 5 - 3 │ 3 4 4 - -julia> f(x) = x + 1 -f (generic function with 1 method) - -julia> transform(df, :a => ByRow(f)) -3×3 DataFrame - Row │ a b a_f - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 -``` - -Alternatively, you may just want to define the function itself so it -[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -over vectors. - -```julia -julia> g(x) = x .+ 1 -g (generic function with 1 method) - -julia> transform(df, :a => g) -3×3 DataFrame - Row │ a b a_g - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 - -julia> h(x, y) = x .+ y .+ 1 -h (generic function with 1 method) - -julia> transform(df, [:a, :b] => h) -3×3 DataFrame - Row │ a b a_b_h - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 6 - 2 │ 2 5 8 - 3 │ 3 4 8 -``` - -[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) -are a convenient way to define and use an `operation_function` -all within the manipulation function call. - -```julia -julia> select(df, :a => ByRow(x -> x + 1)) -3×1 DataFrame - Row │ a_function - │ Int64 -─────┼──────────── - 1 │ 2 - 2 │ 3 - 3 │ 4 - -julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) -3×3 DataFrame - Row │ a b a_b_function - │ Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 4 6 - 2 │ 2 5 9 - 3 │ 3 4 10 - -julia> subset(df, :b => ByRow(x -> x < 5)) -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 - -julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 -``` - -!!! Note - `operation_functions` within `subset` or `subset!` function calls - must return a Boolean vector. - `true` elements in the Boolean vector will determine - which rows are retained in the resulting data frame. - -As demonstrated above, `DataFrame` columns are usually passed -from `source_column_selector` to `operation_function` as one or more -vector arguments. -However, when `AsTable(source_column_selector)` is used, -the selected columns are collected and passed as a single `NamedTuple` -to `operation_function`. - -This is often useful when your `operation_function` is defined to operate -on a single collection argument rather than on multiple positional arguments. -The distinction is somewhat similar to the difference between the built-in -`min` and `minimum` functions. -`min` is defined to find the minimum value among multiple positional arguments, -while `minimum` is defined to find the minimum value -among the elements of a single collection argument. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 2 - 2 │ 2 4 6 1 - -julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments -2×1 DataFrame - Row │ a_b_etc_min - │ Int64 -─────┼───────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection -2×1 DataFrame - Row │ a_b_etc_minimum - │ Int64 -─────┼───────────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments -2×1 DataFrame - Row │ a_b_+ - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 6 - -julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection -2×1 DataFrame - Row │ a_b_sum - │ Int64 -─────┼───────── - 1 │ 4 - 2 │ 6 - -julia> using Statistics # contains the `mean` function - -julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection -2×1 DataFrame - Row │ b_c_d_mean - │ Float64 -─────┼──────────── - 1 │ 3.33333 - 2 │ 3.66667 -``` - -`AsTable` can also be used to pass columns to a function which operates -on fields of a `NamedTuple`. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 7 - 2 │ 2 4 6 8 - -julia> f(nt) = nt.a + nt.d -f (generic function with 1 method) - -julia> transform(df, AsTable(:) => ByRow(f)) -2×5 DataFrame - Row │ a b c d a_b_etc_f - │ Int64 Int64 Int64 Int64 Int64 -─────┼─────────────────────────────────────── - 1 │ 1 3 5 7 8 - 2 │ 2 4 6 8 10 -``` - -As demonstrated above, -in the `source_column_selector => operation_function` operation pair form, -the results of an operation will be placed into a new column with an -automatically-generated name based on the operation; -the new column name will be the `operation_function` name -appended to the source column name(s) with an underscore. - -This automatic column naming behavior can be avoided in two ways. -First, the operation result can be placed back into the original column -with the original column name by switching the keyword argument `renamecols` -from its default value (`true`) to `renamecols=false`. -This option prevents the function name from being appended to the column name -as it usually would be. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 11 5 - 2 │ 12 6 - 3 │ 13 7 - 4 │ 14 8 -``` - -The second method to avoid the default manipulation column naming is to -specify your own `new_column_names`. - -### `new_column_names` - -`new_column_names` can be included at the end of an `operation` pair to specify -the name of the new column(s). -`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, Cols(:) => ByRow(+) => :c) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, Cols(:) => ByRow(+) => "a+b") -4×3 DataFrame - Row │ a b a+b - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, :a => ByRow(x->x+10) => "a+10") -4×3 DataFrame - Row │ a b a+10 - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 11 - 2 │ 2 6 12 - 3 │ 3 7 13 - 4 │ 4 8 14 -``` - -The `source_column_selector => new_column_names` operation form -can be used to rename columns without an intermediate function. -However, there are `rename` and `rename!` functions, -which accept similar syntax, -that tend to be more useful for this operation. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => :apple) # adds column `apple` -4×3 DataFrame - Row │ a b apple - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 - -julia> select(df, :a => :apple) # retains only column `apple` -4×1 DataFrame - Row │ apple - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - 4 │ 4 - -julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place -4×2 DataFrame - Row │ apple b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -If `new_column_names` already exist in the source data frame, -those columns will be replaced in the existing column location -rather than being added to the end. -This can be done by manually specifying an existing column name -or by using the `renamecols=false` keyword argument. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name -4×3 DataFrame - Row │ a b b_function - │ Int64 Int64 Int64 -─────┼────────────────────────── - 1 │ 1 5 15 - 2 │ 2 6 16 - 3 │ 3 7 17 - 4 │ 4 8 18 - -julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 15 - 2 │ 2 16 - 3 │ 3 17 - 4 │ 4 18 - -julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 15 5 - 2 │ 16 6 - 3 │ 17 7 - 4 │ 18 8 -``` - -Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. - -```julia -julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name -4×3 DataFrame - Row │ a b a_b_+ - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name -4×3 DataFrame - Row │ a b a_b - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 6 5 - 2 │ 8 6 - 3 │ 10 7 - 4 │ 12 8 -``` - -In the `source_column_selector => operation_function => new_column_names` operation form, -`new_column_names` may also be a renaming function which operates on a string -to create the destination column names programmatically. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> add_prefix(s) = "new_" * s -add_prefix (generic function with 1 method) - -julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 - -julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 -``` - -!!! Note - It is a good idea to wrap anonymous functions in parentheses - to avoid the `=>` operator accidently becoming part of the anonymous function. - The examples above do not work correctly without the parentheses! - ```julia - julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼──────────────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>add_prefix - 2 │ 2 6 [10, 20, 30, 40]=>add_prefix - 3 │ 3 7 [10, 20, 30, 40]=>add_prefix - 4 │ 4 8 [10, 20, 30, 40]=>add_prefix - - julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼───────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>#18 - 2 │ 2 6 [10, 20, 30, 40]=>#18 - 3 │ 3 7 [10, 20, 30, 40]=>#18 - 4 │ 4 8 [10, 20, 30, 40]=>#18 - ``` - -A renaming function will not work in the -`source_column_selector => new_column_names` operation form -because a function in the second element of the operation pair is assumed to take -the `source_column_selector => operation_function` operation form. -To work around this limitation, use the -`source_column_selector => operation_function => new_column_names` operation form -with `identity` as the `operation_function`. - -```julia -julia> transform(df, :a => add_prefix) -ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) - -julia> transform(df, :a => identity => add_prefix) -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 -``` - -In this case though, -it is probably again more useful to use the `rename` or `rename!` function -rather than one of the manipulation functions -in order to rename in-place and avoid the intermediate `operation_function`. -```julia -julia> rename(add_prefix, df) # rename all columns with a function -4×2 DataFrame - Row │ new_a new_b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> rename(add_prefix, df; cols=:a) # rename some columns with a function -4×2 DataFrame - Row │ new_a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -In the `source_column_selector => new_column_names` operation form, -only a single source column may be selected per operation, -so why is `new_column_names` plural? -It is possible to split the data contained inside a single column -into multiple new columns by supplying a vector of strings or symbols -as `new_column_names`. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> transform(df, :data => [:first, :second]) # manual naming -2×3 DataFrame - Row │ data first second - │ Tuple… Int64 Int64 -─────┼─────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -This kind of data splitting can even be done automatically with `AsTable`. - -```julia -julia> transform(df, :data => AsTable) # default automatic naming with tuples -2×3 DataFrame - Row │ data x1 x2 - │ Tuple… Int64 Int64 -─────┼────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -If a data frame column contains `NamedTuple`s, -then `AsTable` will preserve the field names. -```julia -julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples -2×1 DataFrame - Row │ data - │ NamedTup… -─────┼──────────────── - 1 │ (a = 1, b = 2) - 2 │ (a = 3, b = 4) - -julia> transform(df, :data => AsTable) # keeps names from named tuples -2×3 DataFrame - Row │ data a b - │ NamedTup… Int64 Int64 -─────┼────────────────────────────── - 1 │ (a = 1, b = 2) 1 2 - 2 │ (a = 3, b = 4) 3 4 -``` - -!!! Note - To pack multiple columns into a single column of `NamedTuple`s - (reverse of the above operation) - apply the `identity` function `ByRow`, e.g. - `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. - -Renaming functions also work for multi-column transformations, -but they must operate on a vector of strings. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> new_names(v) = ["primary ", "secondary "] .* v -new_names (generic function with 1 method) - -julia> transform(df, :data => identity => new_names) -2×3 DataFrame - Row │ data primary data secondary data - │ Tuple… Int64 Int64 -─────┼────────────────────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -## Applying Multiple Operations per Manipulation -All data frame manipulation functions can accept multiple `operation` pairs -at once using any of the following methods: -- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments -- `manipulation_function(dataframe, [operation1, operation2])` : vector argument -- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument - -Passing multiple operations is especially useful for the `select`, `select!`, -and `combine` manipulation functions, -since they only retain columns which are a result of the passed operations. - -```julia -julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 1 50 hat - 2 │ 2 50 bat - 3 │ 3 60 cat - 4 │ 4 60 dog - -julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations -1×3 DataFrame - Row │ a_maximum b_sum c_join - │ Int64 Int64 String -─────┼──────────────────────────────── - 1 │ 4 220 hatbatcatdog - -julia> select(df, :c, :b, :a) # re-order columns -4×3 DataFrame - Row │ c b a - │ String Int64 Int64 -─────┼────────────────────── - 1 │ hat 50 1 - 2 │ bat 50 2 - 3 │ cat 60 3 - 4 │ dog 60 4 - -ulia> select(df, :b, :) # `:` here means all other columns -4×3 DataFrame - Row │ b a c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 50 1 hat - 2 │ 50 2 bat - 3 │ 60 3 cat - 4 │ 60 4 dog - -julia> select( - df, - :c => (x -> "a " .* x) => :one_c, - :a => (x -> 100x), - :b, - renamecols=false - ) # can mix operation forms -4×3 DataFrame - Row │ one_c a b - │ String Int64 Int64 -─────┼────────────────────── - 1 │ a hat 100 50 - 2 │ a bat 200 50 - 3 │ a cat 300 60 - 4 │ a dog 400 60 - -julia> select( - df, - :c => ByRow(reverse), - :c => ByRow(uppercase) - ) # multiple operations on same column -4×2 DataFrame - Row │ c_reverse c_uppercase - │ String String -─────┼──────────────────────── - 1 │ tah HAT - 2 │ tab BAT - 3 │ tac CAT - 4 │ god DOG -``` - -In the last two examples, -the manipulation function arguments were split across multiple lines. -This is a good way to make manipulations with many operations more readable. - -Passing multiple operations to `subset` or `subset!` is an easy way to narrow in -on a particular row of data. - -```julia -julia> subset( - df, - :b => ByRow(==(60)), - :c => ByRow(contains("at")) - ) # rows with 60 and "at" -1×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 3 60 cat -``` - -Note that all operations within a single manipulation must use the data -as it existed before the function call -i.e. you cannot use newly created columns for subsequent operations -within the same manipulation. - -```julia -julia> transform( - df, - [:a, :b] => ByRow(+) => :d, - :d => (x -> x ./ 2), - ) # requires two separate transformations -ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c - -julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) -4×4 DataFrame - Row │ a b c d - │ Int64 Int64 String Int64 -─────┼───────────────────────────── - 1 │ 1 50 hat 51 - 2 │ 2 50 bat 52 - 3 │ 3 60 cat 63 - 4 │ 4 60 dog 64 - -julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) -4×5 DataFrame - Row │ a b c d d_2 - │ Int64 Int64 String Int64 Float64 -─────┼────────────────────────────────────── - 1 │ 1 50 hat 51 25.5 - 2 │ 2 50 bat 52 26.0 - 3 │ 3 60 cat 63 31.5 - 4 │ 4 60 dog 64 32.0 -``` - - -## Broadcasting Operation Pairs - -[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -pairs with `.=>` is often a convenient way to generate multiple -similar `operation`s to be applied within a single manipulation. -Broadcasting within the `Pair` of an `operation` is no different than -broadcasting in base Julia. -The broadcasting `.=>` will be expanded into a vector of pairs -(`[operation1, operation2, ...]`), -and this expansion will occur before the manipulation function is invoked. -Then the manipulation function will use the -`manipulation_function(dataframe, [operation1, operation2, ...])` method. -This process will be explained in more detail below. - -To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. -In DataFrames.jl, a symbol, string, or integer -may be used to select a single column. -Some `Pair`s with these types are below. - -```julia -julia> typeof(:x => :a) -Pair{Symbol, Symbol} - -julia> typeof("x" => "a") -Pair{String, String} - -julia> typeof(1 => "a") -Pair{Int64, String} -``` - -Any of the `Pair`s above could be used to rename the first column -of the data frame below to `a`. - -```julia -julia> df = DataFrame(x = 1:3, y = 4:6) -3×2 DataFrame - Row │ x y - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 - -julia> select(df, :x => :a) -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - -julia> select(df, 1 => "a") -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 -``` - -What should we do if we want to keep and rename both the `x` and `y` column? -One option is to supply a `Vector` of operation `Pair`s to `select`. -`select` will process all of these operations in order. - -```julia -julia> ["x" => "a", "y" => "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x" => "a", "y" => "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -We can use broadcasting to simplify the syntax above. - -```julia -julia> ["x", "y"] .=> ["a", "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x", "y"] .=> ["a", "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -Notice that `select` sees the same `Vector{Pair{String, String}}` operation -argument whether the individual pairs are written out explicitly or -constructed with broadcasting. -The broadcasting is applied before the call to `select`. - -```julia -julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) -true -``` - -!!! Note - These operation pairs (or vector of pairs) can be given variable names. - This is uncommon in practice but could be helpful for intermediate - inspection and testing. - ```julia - df = DataFrame(x = 1:3, y = 4:6) # create data frame - operation = ["x", "y"] .=> ["a", "b"] # save operation to variable - typeof(operation) # check type of operation - first(operation) # check first pair in operation - last(operation) # check last pair in operation - select(df, operation) # manipulate `df` with `operation` - ``` - -In Julia, -a non-vector broadcasted with a vector will be repeated in each resultant pair element. - -```julia -julia> ["x", "y"] .=> :a # :a is repeated -2-element Vector{Pair{String, Symbol}}: - "x" => :a - "y" => :a - -julia> 1 .=> [:a, :b] # 1 is repeated -2-element Vector{Pair{Int64, Symbol}}: - 1 => :a - 1 => :b -``` - -We can use this fact to easily broadcast an `operation_function` to multiple columns. - -```julia -julia> f(x) = 2 * x -f (generic function with 1 method) - -julia> ["x", "y"] .=> f # f is repeated -2-element Vector{Pair{String, typeof(f)}}: - "x" => f - "y" => f - -julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming -3×2 DataFrame - Row │ x_f y_f - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 - -julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated -2-element Vector{Pair{String, Pair{typeof(f), String}}}: - "x" => (f => "a") - "y" => (f => "b") - -julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -A renaming function can be applied to multiple columns in the same way. -It will also be repeated in each operation `Pair`. - -```julia -julia> newname(s::String) = s * "_new" -newname (generic function with 1 method) - -julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated -2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: - "x" => (f => newname) - "y" => (f => newname) - -julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname -3×2 DataFrame - Row │ x_new y_new - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -You can see from the type output above -that a three element pair does not actually exist. -A `Pair` (as the name implies) can only contain two elements. -Thus, `:x => :y => :z` becomes a nested `Pair`, -where `:x` is the first element and points to the `Pair` `:y => :z`, -which is the second element. - -```julia -julia> p = :x => :y => :z -:x => (:y => :z) - -julia> p[1] -:x - -julia> p[2] -:y => :z - -julia> p[2][1] -:y - -julia> p[2][2] -:z - -julia> p[3] # there is no index 3 for a pair -ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] -``` - -In the previous examples, the source columns have been individually selected. -When broadcasting multiple columns to the same function, -often similarities in the column names or position can be exploited to avoid -tedious selection. -Consider a data frame with temperature data at three different locations -taken over time. -```julia -julia> df = DataFrame(Time = 1:4, - Temperature1 = [20, 23, 25, 28], - Temperature2 = [33, 37, 41, 44], - Temperature3 = [15, 10, 4, 0]) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 20 33 15 - 2 │ 2 23 37 10 - 3 │ 3 25 41 4 - 4 │ 4 28 44 0 -``` - -To convert all of the temperature data in one transformation, -we just need to define a conversion function and broadcast -it to all of the "Temperature" columns. - -```julia -julia> celsius_to_kelvin(x) = x + 273 -celsius_to_kelvin (generic function with 1 method) - -julia> transform( - df, - Cols(r"Temp") .=> ByRow(celsius_to_kelvin), - renamecols = false - ) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` -Or, simultaneously changing the column names: - -```julia -julia> rename_function(s) = "Temperature $(last(s)) (K)" -rename_function (generic function with 1 method) - -julia> select( - df, - "Time", - Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function - ) -4×4 DataFrame - Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` - -!!! Note Notes - * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. - * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. - Without `ByRow`, the manipulations above would have thrown - `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. - * Regular expression (`r""`) and `:` `source_column_selectors` - must be wrapped in `Cols` to be properly broadcasted - because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. - -You could also broadcast different columns to different functions -by supplying a vector of functions. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> f1(x) = x .+ 1 -f1 (generic function with 1 method) - -julia> f2(x) = x ./ 10 -f2 (generic function with 1 method) - -julia> transform(df, [:a, :b] .=> [f1, f2]) -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -However, this form is not much more convenient than supplying -multiple individual operations. - -```julia -julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -Perhaps more useful for broadcasting syntax -is to apply multiple functions to multiple columns -by changing the vector of functions to a 1-by-x matrix of functions. -(Recall that a list, a vector, or a matrix of operation pairs are all valid -for passing to the manipulation functions.) - -```julia -julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 -2×2 Matrix{Pair{Symbol}}: - :a=>f1 :a=>f2 - :b=>f1 :b=>f2 - -julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 -4×6 DataFrame - Row │ a b a_f1 b_f1 a_f2 b_f2 - │ Int64 Int64 Int64 Int64 Float64 Float64 -─────┼────────────────────────────────────────────── - 1 │ 1 5 2 6 0.1 0.5 - 2 │ 2 6 3 7 0.2 0.6 - 3 │ 3 7 4 8 0.3 0.7 - 4 │ 4 8 5 9 0.4 0.8 -``` - -In this way, every combination of selected columns and functions will be applied. - -Pair broadcasting is a simple but powerful tool -that can be used in any of the manipulation functions listed under -[Basic Usage of Manipulation Functions](@ref). -Experiment for yourself to discover other useful operations. - -## Additional Resources -More details and examples of operation pair syntax can be found in -[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). -(The official wording describing the syntax has changed since the blog post was written, -but the examples are still illustrative. -The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language -or Domain-Specific Language.) - -For additional practice, -an interactive tutorial is provided on a variety of introductory topics -by the DataFrames.jl package author -[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). - - -For additional syntax niceties, -many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) -and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) -packages useful -to help simplify manipulations that may be tedious with operation pairs alone. \ No newline at end of file