-
Notifications
You must be signed in to change notification settings - Fork 290
/
02-tidyverse.Rmd
279 lines (192 loc) · 19.7 KB
/
02-tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# A Tidyverse Primer {#tidyverse}
```{r tidyverse-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidyverse)
library(lubridate)
```
What is the tidyverse, and where does the tidymodels framework fit in? The tidyverse is a collection of R packages for data analysis that are developed with common ideas and norms. From @tidyverse:
> "At a high level, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next."
In this chapter, we briefly discuss important principles of the tidyverse design philosophy and how they apply in the context of modeling software that is easy to use properly and supports good statistical practice, like we outlined in Chapter \@ref(software-modeling). The next chapter covers modeling conventions from the core R language. Together, you can use these discussions to understand the relationships between the tidyverse, tidymodels, and the core or base R language. Both tidymodels and the tidyverse build on the R language, and tidymodels applies tidyverse principles to building models.
## Tidyverse Principles
The full set of strategies and tactics for writing R code in the tidyverse style can be found at the website <https://design.tidyverse.org>. Here we can briefly describe several of the general tidyverse design principles, their motivation, and how we think about modeling as an application of these principles.
### Design for humans
The tidyverse focuses on designing R packages and functions that can be easily understood and used by a broad range of people. Both historically and today, a substantial percentage of R users are not people who create software or tools but instead people who create analyses or models. As such, R users do not typically have (or need) computer science backgrounds, and many are not interested in writing their own R packages.
For this reason, it is critical that R code be easy to work with to accomplish your goals. Documentation, training, accessibility, and other factors play an important part in achieving this. However, if the syntax itself is difficult for people to easily comprehend, documentation is a poor solution. The software itself must be intuitive.
To contrast the tidyverse approach with more traditional R semantics, consider sorting a data frame. Data frames can represent different types of data in each column, and multiple values in each row. Using only the core language, we can sort a data frame using one or more columns by reordering the rows via R's subscripting rules in conjunction with `order()`; you cannot successfully use a function you might be tempted to try in such a situation because of its name, `sort()`. To sort the `mtcars` data by two of its columns, the call might look like:
```{r tidyverse-base-sort, eval = FALSE}
mtcars[order(mtcars$gear, mtcars$mpg), ]
```
While very computationally efficient, it would be difficult to argue that this is an intuitive user interface. In `r pkg(dplyr)` by contrast, the tidyverse function `arrange()` takes a set of variable names as input arguments directly:
```{r tidyverse-dplyr-sort, eval = FALSE}
library(dplyr)
arrange(.data = mtcars, gear, mpg)
```
:::rmdnote
The variable names used here are "unquoted"; many traditional R functions require a character string to specify variables, but tidyverse functions take unquoted names or _selector functions_. The selectors allow for one or more readable rules that are applied to the column names. For example, `ends_with("t")` would select the `drat` and `wt` columns of the `mtcars` data frame.
:::
Additionally, naming is crucial. If you were new to R and were writing data analysis or modeling code involving linear algebra, you might be stymied when searching for a function that computes the matrix inverse. Using `apropos("inv")` yields no candidates. It turns out that the base R function for this task is `solve()`, for solving systems of linear equations. For a matrix `X`, you would use `solve(X)` to invert `X` (with no vector for the right-hand side of the equation). This is only documented in the description of one of the _arguments_ in the help file. In essence, you need to know the name of the solution to be able to find the solution.
The tidyverse approach is to use function names that are descriptive and explicit over those that are short and implicit. There is a focus on verbs (e.g., `fit`, `arrange`, etc.) for general methods. Verb-noun pairs are particularly effective; consider `invert_matrix()` as a hypothetical function name. In the context of modeling, it is also important to avoid highly technical jargon, such as Greek letters or obscure terms in terms. Names should be as self-documenting as possible.
When there are similar functions in a package, function names are designed to be optimized for tab-completion. For example, the `r pkg(glue)` package has a collection of functions starting with a common prefix (`glue_`) that enables users to quickly find the function they are looking for.
### Reuse existing data structures
Whenever possible, functions should avoid returning a novel data structure. If the results are conducive to an existing data structure, it should be used. This reduces the cognitive load when using software; no additional syntax or methods are required.
The data frame is the preferred data structure in tidyverse and tidymodels packages, because its structure is a good fit for such a broad swath of data science tasks. Specifically, the tidyverse and tidymodels favor the tibble, a modern reimagining of R's data frame that we describe in the next section on example tidyverse syntax.
As an example, the `r pkg(rsample)` package can be used to create _resamples_ of a data set, such as cross-validation or the bootstrap (described in Chapter \@ref(resampling)). The resampling functions return a tibble with a column called `splits` of objects that define the resampled data sets. Three bootstrap samples of a data set might look like:
```{r tidyverse-resample}
boot_samp <- rsample::bootstraps(mtcars, times = 3)
boot_samp
class(boot_samp)
```
With this approach, vector-based functions can be used with these columns, such as `vapply()` or `purrr::map()`.^[If you've never seen `::` in R code before, it is an explicit method for calling a function. The value of the left-hand side is the _namespace_ where the function lives (usually a package name). The right-hand side is the function name. In cases where two packages use the same function name, this syntax ensures that the correct function is called.] This `boot_samp` object has multiple classes but inherits methods for data frames (`"data.frame"`) and tibbles (`"tbl_df"`). Additionally, new columns can be added to the results without affecting the class of the data. This is much easier and more versatile for users to work with than a completely new object type that does not make its data structure obvious.
One downside to relying on common data structures is the potential loss of computational performance. In some situations, data can be encoded in specialized formats that are more efficient representations of the data. For example:
* In computational chemistry, the structure-data file format (SDF) is a tool to take chemical structures and encode them in a format that is computationally efficient to work with.
* Data that have a large number of values that are the same (such as zeros for binary data) can be stored in a sparse matrix format. This format can reduce the size of the data as well as enable more efficient computational techniques.
These formats are advantageous when the problem is well scoped and the potential data processing methods are both well defined and suited to such a format.^[Not all algorithms can take advantage of sparse representations of data. In such cases, a sparse matrix must be converted to a more conventional format before proceeding.] However, once such constraints are violated, specialized data formats are less useful. For example, if we perform a transformation of the data that converts the data into fractional numbers, the output is no longer sparse; the sparse matrix representation is helpful for one specific algorithmic step in modeling, but this is often not true before or after that specific step.
:::rmdwarning
A specialized data structure is not flexible enough for an entire modeling workflow in the way that a common data structure is.
:::
One important feature in the tibble produced by `r pkg(rsample)` is that the `splits` column is a list. In this instance, each element of the list has the same type of object: an `rsplit` object that contains the information about which rows of `mtcars` belong in the bootstrap sample. _List columns_ can be very useful in data analysis and, as will be seen throughout this book, are important to tidymodels.
### Design for the pipe and functional programming
The `r pkg(magrittr)` pipe operator (`%>%`) is a tool for chaining together a sequence of R functions.^[In R 4.1.0, a native pipe operator `|>` was introduced as well. In this book, we use the `r pkg(magrittr)` pipe since users on older versions of R will not have the new native pipe.] To demonstrate, consider the following commands that sort a data frame and then retain the first 10 rows:
```{r tidyverse-no-pipe, eval = FALSE}
small_mtcars <- arrange(mtcars, gear)
small_mtcars <- slice(small_mtcars, 1:10)
# or more compactly:
small_mtcars <- slice(arrange(mtcars, gear), 1:10)
```
The pipe operator substitutes the value of the left-hand side of the operator as the first argument to the right-hand side, so we can implement the same result as before with:
```{r tidyverse-pipe, eval = FALSE}
small_mtcars <-
mtcars %>%
arrange(gear) %>%
slice(1:10)
```
The piped version of this sequence is more readable; this readability increases as more operations are added to a sequence. This approach to programming works in this example because all of the functions we used return the same data structure (a data frame) that is then the first argument to the next function. This is by design. When possible, create functions that can be incorporated into a pipeline of operations.
If you have used `r pkg(ggplot2)`, this is not unlike the layering of plot components into a `ggplot` object with the `+` operator. To make a scatter plot with a regression line, the initial `ggplot()` call is augmented with two additional operations:
```{r tidyverse-ggplot-chain, eval = FALSE}
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = lm)
```
While similar to the `r pkg(dplyr)` pipeline, note that the first argument to this pipeline is a data set (`mtcars`) and that each function call returns a `ggplot` object. Not all pipelines need to keep the returned values (plot objects) the same as the initial value (a data frame). Using the pipe operator with `r pkg(dplyr)` operations has acclimated many R users to expect to return a data frame when pipelines are used; as shown with `r pkg(ggplot2)`, this does not need to be the case. Pipelines are incredibly useful in modeling workflows but modeling pipelines can return, instead of a data frame, objects such as model components.
R has excellent tools for creating, changing, and operating on functions, making it a great language for functional programming. This approach can replace iterative loops in many situations, such as when a function returns a value without other side effects.^[Examples of function side effects could include changing global data or printing a value.]
Let's look at an example. Suppose you are interested in the logarithm of the ratio of the fuel efficiency to the car weight. To those new to R and/or coming from other programming languages, a loop might seem like a good option:
```{r tidyverse-loop}
n <- nrow(mtcars)
ratios <- rep(NA_real_, n)
for (car in 1:n) {
ratios[car] <- log(mtcars$mpg[car]/mtcars$wt[car])
}
head(ratios)
```
Those with more experience in R may know that there is a much simpler and faster vectorized version that can be computed by:
```{r tidyverse-vectorized}
ratios <- log(mtcars$mpg/mtcars$wt)
```
However, in many real-world cases, the element-wise operation of interest is too complex for a vectorized solution. In such a case, a good approach is to write a function to do the computations. When we design for functional programming, it is important that the output depends only on the inputs and that the function has no side effects. Violations of these ideas in the following function are shown with comments:
```{r tidyverse-non-functional}
compute_log_ratio <- function(mpg, wt) {
log_base <- getOption("log_base", default = exp(1)) # gets external data
results <- log(mpg/wt, base = log_base)
print(mean(results)) # prints to the console
done <<- TRUE # sets external data
results
}
```
A better version would be:
```{r tidyverse-better-function}
compute_log_ratio <- function(mpg, wt, log_base = exp(1)) {
log(mpg/wt, base = log_base)
}
```
The `r pkg(purrr)` package contains tools for functional programming. Let's focus on the `map()` family of functions, which operates on vectors and always returns the same type of output. The most basic function, `map()`, always returns a list and uses the basic syntax of `map(vector, function)`. For example, to take the square root of our data, we could:
```{r map-basic}
map(head(mtcars$mpg, 3), sqrt)
```
There are specialized variants of `map()` that return values when we know or expect that the function will generate one of the basic vector types. For example, since the square root returns a double-precision number:
```{r map-dbl}
map_dbl(head(mtcars$mpg, 3), sqrt)
```
There are also mapping functions that operate across multiple vectors:
```{r map2}
log_ratios <- map2_dbl(mtcars$mpg, mtcars$wt, compute_log_ratio)
head(log_ratios)
```
The `map()` functions also allow for temporary, anonymous functions defined using the tilde character. The argument values are `.x` and `.y` for `map2()`:
```{r map2-inline}
map2_dbl(mtcars$mpg, mtcars$wt, ~ log(.x/.y)) %>%
head()
```
These examples have been trivial but, in later sections, will be applied to more complex problems.
:::rmdnote
For functional programming in tidy modeling, functions should be defined so that functions like `map()` can be used for iterative computations.
:::
## Examples of Tidyverse Syntax
Let's begin our discussion of tidyverse syntax by exploring more deeply what a tibble is, and how tibbles work. Tibbles have slightly different rules than basic data frames in R. For example, tibbles naturally work with column names that are not syntactically valid variable names:
```{r tidyverse-names}
# Wants valid names:
data.frame(`variable 1` = 1:2, two = 3:4)
# But can be coerced to use them with an extra option:
df <- data.frame(`variable 1` = 1:2, two = 3:4, check.names = FALSE)
df
# But tibbles just work:
tbbl <- tibble(`variable 1` = 1:2, two = 3:4)
tbbl
```
Standard data frames enable _partial matching_ of arguments so that code using only a portion of the column names still works. Tibbles prevent this from happening since it can lead to accidental errors.
```{r tidyverse-partial, error = TRUE}
df$tw
tbbl$tw
```
Tibbles also prevent one of the most common R errors: dropping dimensions. If a standard data frame subsets the columns down to a single column, the object is converted to a vector. Tibbles never do this:
```{r tidyverse-drop}
df[, "two"]
tbbl[, "two"]
```
There are other advantages to using tibbles instead of data frames, such as better printing and more.^[Chapter 10 of @wickham2016 has more details on tibbles.]
```{r tidyverse-import-raw, include = FALSE}
url <- "chi.csv"
train_cols <-
cols(
station_id = col_double(),
stationname = col_character(),
date = col_character(),
daytype = col_character(),
rides = col_double()
)
num_combos <-
read_delim(url, delim = ",", col_types = train_cols) %>%
distinct(date, stationname) %>%
nrow()
```
To demonstrate some syntax, let's use tidyverse functions to read in data that could be used in modeling. The data set comes from the city of Chicago's data portal and contains daily ridership data for the city's elevated train stations. The data set has columns for:
- the station identifier (numeric)
- the station name (character)
- the date (character in `mm/dd/yyyy` format)
- the day of the week (character)
- the number of riders (numeric)
Our tidyverse pipeline will conduct the following tasks, in order:
1. Use the tidyverse package `r pkg(readr)` to read the data from the source website and convert them into a tibble. To do this, the `read_csv()` function can determine the type of data by reading an initial number of rows. Alternatively, if the column names and types are already known, a column specification can be created in R and passed to `read_csv()`.
1. Filter the data to eliminate a few columns that are not needed (such as the station ID) and change the column `stationname` to `station`. The function `select()` is used for this. When filtering, use either the column names or a `r pkg(dplyr)` selector function. When selecting names, a new variable name can be declared using the argument format `new_name = old_name`.
1. Convert the date field to the R date format using the `mdy()` function from the `r pkg(lubridate)` package. We also convert the ridership numbers to thousands. Both of these computations are executed using the `dplyr::mutate()` function.
1. Use the maximum number of rides for each station and day combination. This mitigates the issue of a small number of days that have more than one record of ridership numbers at certain stations. We group the ridership data by station and day, and then summarize within each of the `r num_combos` unique combinations with the maximum statistic.
The tidyverse code for these steps is:
```{r tidyverse-import, message = FALSE}
library(tidyverse)
library(lubridate)
url <- "https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
all_stations <-
# Step 1: Read in the data.
read_csv(url) %>%
# Step 2: filter columns and rename stationname
dplyr::select(station = stationname, date, rides) %>%
# Step 3: Convert the character date field to a date encoding.
# Also, put the data in units of 1K rides
mutate(date = mdy(date), rides = rides / 1000) %>%
# Step 4: Summarize the multiple records using the maximum.
group_by(date, station) %>%
summarize(rides = max(rides), .groups = "drop")
```
This pipeline of operations illustrates why the tidyverse is popular. A series of data manipulations is used that have simple and easy to understand functions for each transformation; the series is bundled in a streamlined, readable way. The focus is on how the user interacts with the software. This approach enables more people to learn R and achieve their analysis goals, and adopting these same principles for modeling in R has the same benefits.
## Chapter Summary
This chapter introduced the tidyverse, with a focus on applications for modeling and how tidyverse design principles inform the tidymodels framework. Think of the tidymodels framework as applying tidyverse principles to the domain of building models. We described differences in conventions between the tidyverse and base R, and introduced two important components of the tidyverse system, tibbles and the pipe operator `%>%`. Data cleaning and processing can feel mundane at times, but these tasks are important for modeling in the real world; we illustrated how to use tibbles, the pipe, and tidyverse functions in an example data import and processing exercise.