diff --git a/docs/make.jl b/docs/make.jl index c35d55b0b..d854981e2 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -26,7 +26,6 @@ makedocs( "Working with DataFrames" => "man/working_with_dataframes.md", "Importing and Exporting Data (I/O)" => "man/importing_and_exporting.md", "Joins" => "man/joins.md", - "Data Frame Manipulation Functions" => "man/manipulation_functions.md", "Split-apply-combine" => "man/split_apply_combine.md", "Reshaping" => "man/reshaping_and_pivoting.md", "Sorting" => "man/sorting.md", @@ -35,6 +34,7 @@ makedocs( "Data manipulation frameworks" => "man/querying_frameworks.md", "Comparison with Python/R/Stata" => "man/comparisons.md" ], + "A Gentle Introduction to Data Frame Manipulation Functions" => "man/manipulation_functions.md", "API" => Any[ "Types" => "lib/types.md", "Functions" => "lib/functions.md", diff --git a/docs/src/index.md b/docs/src/index.md index ea8697e9b..78c9ecd92 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -218,7 +218,6 @@ page](https://github.com/JuliaData/DataFrames.jl/releases). Pages = ["man/basics.md", "man/getting_started.md", "man/joins.md", - "man/manipulation_functions.md", "man/split_apply_combine.md", "man/reshaping_and_pivoting.md", "man/sorting.md", @@ -229,6 +228,13 @@ Pages = ["man/basics.md", Depth = 2 ``` +## A Gentle Introduction to Data Frame Manipulation Functions + +```@contents +Pages = ["man/manipulation_functions.md"] +Depth = 1 +``` + ## API Only exported (i.e. available for use without `DataFrames.` qualifier after diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 5980083d0..7f77d555b 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1586,9 +1586,14 @@ which can be used to perform operations on data frame columns: as the source data frame, but with only the rows where the passed operations are true; - `subset!`: the same as `subset` but updates the passed data frame in place; -These functions and their methods are explained in more detail in the section -[Data Frame Manipulation Functions](@ref). -In this section, we will move straight to examples using the German dataset. +!!! Note Other Resources + * For formal, comprehensive explanations of all manipulation functions, + see the [Functions](@ref) API. + + * For an informal, long-form tutorial on these functions, + see [A Gentle Introduction to Data Frame Manipulation Functions](@ref). + +Let us now move straight to examples using the German dataset. ```jldoctest dataframe julia> using Statistics @@ -2153,7 +2158,7 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) This concludes the introductory examples of data frame manipulations. See later sections of the manual, -particularly [Data Frame Manipulation Functions](@ref), +particularly [A Gentle Introduction to Data Frame Manipulation Functions](@ref), for additional explanations and functionality, including how to broadcast operation functions and operation pairs and how to pass or produce multiple columns using `AsTable`. diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md index da4fb1e63..db62e7adb 100644 --- a/docs/src/man/manipulation_functions.md +++ b/docs/src/man/manipulation_functions.md @@ -1,7 +1,10 @@ -# Data Frame Manipulation Functions +# A Gentle Introduction to Data Frame Manipulation Functions The seven functions below can be used to manipulate data frames by applying operations to them. +This section of the documentation aims to methodically build understanding +of these functions and their possible arguments +by reinforcing foundational concepts and slowly increasing complexity. The functions without a `!` in their name will create a new data frame based on the source data frame, @@ -68,11 +71,11 @@ which is a type to link one object to another. [Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) In DataFrames.jl manipulation functions, `Pair` arguments are used to define column `operations` to be performed. -The provided examples will be explained in more detail below. +The examples shown above will be explained in more detail later. -The manipulation functions also have methods for applying multiple operations. +*The manipulation functions also have methods for applying multiple operations. See the later sections [Multiple Operations per Manipulation](@ref) -and [Broadcasting Operation Pairs](@ref) for more information. +and [Broadcasting Operation Pairs](@ref) for more information.* ### `source_column_selector` Inside an `operation`, `source_column_selector` is usually a column name @@ -494,6 +497,8 @@ This automatic column naming behavior can be avoided in two ways. First, the operation result can be placed back into the original column with the original column name by switching the keyword argument `renamecols` from its default value (`true`) to `renamecols=false`. +This option prevents the function name from being appended to the column name +as it usually would be. ```julia julia> df = DataFrame(a=1:4, b=5:8) @@ -616,9 +621,90 @@ julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place 4 │ 4 8 ``` -Additionally, in the -`source_column_selector => operation_function => new_column_names` operation form, -`new_column_names` may be a renaming function which operates on a string +If `new_column_names` already exist in the source data frame, +those columns will be replaced in the existing column location +rather than being added to the end. +This can be done by manually specifying an existing column name +or by using the `renamecols=false` keyword argument. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name +4×3 DataFrame + Row │ a b b_function + │ Int64 Int64 Int64 +─────┼────────────────────────── + 1 │ 1 5 15 + 2 │ 2 6 16 + 3 │ 3 7 17 + 4 │ 4 8 18 + +julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 15 + 2 │ 2 16 + 3 │ 3 17 + 4 │ 4 18 + +julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 15 5 + 2 │ 16 6 + 3 │ 17 7 + 4 │ 18 8 +``` + +Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. + +```julia +julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name +4×3 DataFrame + Row │ a b a_b_+ + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name +4×3 DataFrame + Row │ a b a_b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 6 5 + 2 │ 8 6 + 3 │ 10 7 + 4 │ 12 8 +``` + +In the `source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may also be a renaming function which operates on a string to create the destination column names programmatically. ```julia