Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TidierData to frameworks docs page #3447

Merged
merged 6 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"

[compat]
Documenter = "1"
139 changes: 139 additions & 0 deletions docs/src/man/querying_frameworks.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,145 @@ DataFramesMeta.jl, DataFrameMacros.jl and Query.jl. They implement a functionali
These frameworks are designed both to make it easier for new users to start working with data frames in Julia
and to allow advanced users to write more compact code.

## TidierData.jl
[TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/), part of
the [Tidier](https://tidierorg.github.io/Tidier.jl/dev/) ecosystem, is a macro-based
data analysis interface that wraps `DataFrames`. The instructions below are for version
bkamins marked this conversation as resolved.
Show resolved Hide resolved
0.16.0 of TidierData.jl.

First, install the TidierData.jl package:

```julia
using Pkg
Pkg.add("TidierData")
```

TidierData.jl enables clean, readable, and fast code for all major data transformation
functions including
[aggregating](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/summarize/),
[pivoting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/pivots/),
[nesting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/nesting/),
and [joining](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/joins/)
data frames. TidierData re-exports `DataFrame()` from DataFrames.jl, `@chain` from Chain.jl, and
bkamins marked this conversation as resolved.
Show resolved Hide resolved
Statistics.jl to streamline data operations.

TidierData.jl is heavily inspired by the `dplyr` and `tidyr` R packages (part of the R
`tidyverse`), which it aims to implement using pure Julia by wrapping DataFrames.jl. While
TidierData.jl borrows conventions from the `tidyverse`, it is important to note that the
`tidyverse` itself is often not considered idiomatic R code. TidierData.jl brings
data analysis conventions from `tidyverse` into Julia to have the best of both worlds:
tidy syntax and the speed and flexibility of the Julia language.

TidierData.jl has two major differences from other macro-based packages. First, TidierData.jl
uses tidy expressions. An example of a tidy expression is `a = mean(b)`, where `b` refers
to an existing column in the data frame, and `a` refers to either a new or existing column.
Referring to variables outside of the data frame requires prefixing variables with `!!`.
For example, `a = mean(!!b)` refers to a variable `b` outside the data frame. Second,
TidierData.jl aims to make broadcasting mostly invisible through
[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/). TidierData.jl currently uses a lookup table to decide which functions not to
vectorize; all other functions are automatically vectorized. This allows for
writing of concise expressions: `@mutate(df, a = a - mean(a))` transforms the `a` column
by subtracting each value by the mean of the column. Behind the scenes, the right-hand
expression is converted to `a .- mean(a)` because `mean()` is in the lookup table as a
function that should not be vectorized. Take a look at the
[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/) documentation for details.

One major benefit of combining tidy expressions with auto-vectorization is that
TidierData.jl code (which uses DataFrames.jl as its backend) can work directly on
databases using [TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl),
which converts tidy expressions into SQL, supporting DuckDB and several other backends.

```jldoctest tidierdata
julia> using TidierData

julia> df = DataFrame(
name = ["John", "Sally", "Roger"],
age = [54.0, 34.0, 79.0],
children = [0, 2, 4]
)
3×3 DataFrame
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ John 54.0 0
2 │ Sally 34.0 2
3 │ Roger 79.0 4

julia> @chain df begin
@filter(children != 2)
@select(name, num_children = children)
end
2×2 DataFrame
Row │ name num_children
│ String Int64
─────┼──────────────────────
1 │ John 0
2 │ Roger 4
```

Below are examples showcasing `@group_by` with `@summarize` or `@mutate` - analagous to the split, apply, combine pattern.

```jldoctest tidierdata
julia> df = DataFrame(
groups = repeat('a':'e', inner = 2),
b_col = 1:10,
c_col = 11:20,
d_col = 111:120
)
10×4 DataFrame
Row │ groups b_col c_col d_col
│ Char Int64 Int64 Int64
─────┼─────────────────────────────
1 │ a 1 11 111
2 │ a 2 12 112
3 │ b 3 13 113
4 │ b 4 14 114
5 │ c 5 15 115
6 │ c 6 16 116
7 │ d 7 17 117
8 │ d 8 18 118
9 │ e 9 19 119
10 │ e 10 20 120

julia> @chain df begin
@filter(b_col > 2)
@group_by(groups)
@summarise(median_b = median(b_col),
across((b_col:d_col), mean))
end
4×5 DataFrame
Row │ groups median_b b_col_mean c_col_mean d_col_mean
│ Char Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────
1 │ b 3.5 3.5 13.5 113.5
2 │ c 5.5 5.5 15.5 115.5
3 │ d 7.5 7.5 17.5 117.5
4 │ e 9.5 9.5 19.5 119.5

julia> @chain df begin
@filter(b_col > 4 && c_col <= 18)
@group_by(groups)
@mutate(
new_col = b_col + maximum(d_col),
new_col2 = c_col - maximum(d_col),
new_col3 = case_when(c_col >= 18 => "high",
c_col > 15 => "medium",
true => "low"))
@select(starts_with("new"))
@ungroup # required because `@mutate` does not ungroup
end
4×4 DataFrame
Row │ groups new_col new_col2 new_col3
│ Char Int64 Int64 String
─────┼─────────────────────────────────────
1 │ c 121 -101 low
2 │ c 122 -100 medium
3 │ d 125 -101 medium
4 │ d 126 -100 high
```

For more examples, please visit the [TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/) documentation.

## DataFramesMeta.jl

The [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl) package
Expand Down
Loading