-
Notifications
You must be signed in to change notification settings - Fork 755
Computing on the language
''Flexibility in syntax, if it does not lead to ambiguity, would seem a reasonable thing to ask of an interactive programming language.'' --- Kent Pitman, http://www.nhplace.com/kent/Papers/Special-Forms.html
R has powerful tools for computing not only on values, but also on the actions that lead to those values. These tools are powerful and magical, and one of the most surprising features if you're coming from another programming language. Take the following simple snippet of code that draws a sine curve:
x <- seq(0, 2 * pi, length = 100)
sinx <- sin(x)
plot(x, sinx, type = "l")
Look at the labels on the axes! How did R know that the variable on the x axis was called x
and the variable on the y axis was called sinx
? In most programming languages, you can only access values of the arguments provided to functions, but in R you can also access the expression used to computing them. Combined with R's lazy evaluation mechanism this gives function authors considerable power to both access the underlying expression and do special things with it.
Techniques based on these tools are generally called "computing on the language", and in R provide a set of tools with power equivalent to functional and object oriented programming. This chapter will introduce you to the basic ideas of special evaluation, show you how they are used in base R, and how you can use them to create your own functions that save typing for interactive analysis. These tools are very useful for developing convenient user-facing functions because they can dramatically reduce the amount of typing required to specify an action.
Computing on the language is an extremely powerful tool, but it can also create code that is hard for others to understand and is substantially harder to program with. Before you use it, make sure that you have exhausted all other possibilities. You'll also learn about the downsides: because these tools work with expressions rather than values this increases ambiguity in the function call, and makes the function difficult to call from another function.
The following chapters expressions and special-environments, expand on these ideas, discussing the underlying data structures and how you can understand and manipulate them to create new tools.
In this chapter you'll learn:
-
how many functions such as
plot()
anddata.frame()
capture the names of the variable supplied to them, and the downsides of this technique. -
Manipulate a data frame by referring to the variables:
subset()
,transform()
,plyr::mutate()
,plyr::arrange()
,plyr::summarise()
,with()
-
Work around non-standard evaluation with (like lattice functions)
substitute
. -
Capture an expression for later evaluation: ggplot2 and plyr
-
Use formulas to describe computations:
lm()
and the lattice package
The tool that makes non-standard evaluation possible in R is substitute()
. It looks at a function argument, and instead of seeing the value, it looks to see how the value was computed:
f <- function(x) {
substitute(x)
}
f(1:10)
f(x)
f(x + y ^ 2 / z + exp(a * sin(b)))
We won't worry yet about exactly what sort of object substitute()
returns (that's the topic of the Expressions chapter), but we'll call it an expression. (Note that it's not the same thing as returned by the expression()
function: we'll call that an expression object.)
substitute()
works because function arguments in R are only evaluated when they are needed, not automatically when the function is called. This means that function arguments are not just a simple value, but instead store both the expression to compute the value and the environment in which to compute it. Together these two things are called a promise. Most of the time in R, you don't need to know anything about promises because the first time you access a promise it is seemlessly evaluated, returning its value.
We need one more function if we want to understand how plot()
and data.frame()
work: deparse()
. This function takes an expression and converts it to a character vector.
g <- function(x) deparse(substitute(x))
g(1:10)
g(x)
g(x + y ^ 2 / z + exp(a * sin(b)))
There's one important caveat with deparse()
: it can return a multiple strings if the input is long:
g(a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z)
If you need a single string, you can work around this by using the width.cutoff
argument (which has a maximum value of 500), or by joining the lines back together again with paste()
.
You might wonder why we couldn't use our original f()
to compute g()
. Let's try it:
g <- function(x) deparse(f(x))
g(1:10)
g(x)
g(x + y ^ 2 / z + exp(a * sin(b)))
This is one of the downsides of functions that use substitute()
: because they use the expression, not the value, of an argument, it becomes harder to call them from other functions. We'll talk more about this problem and some remedies later on.
There are a lots of function in base R that use these ideas. Some use them to avoid quotes:
library(ggplot2)
library("ggplot2")
Others use them to provide default labels. For example, plot.default()
has code that basically does (the real code is more complicated because of the way base plotting methods work, but it's effectively the same):
plot.default <- function(x, y = NULL, xlabel = NULL, ylabel = NULL, ...) {
...
xlab <- if (is.null(xlabel) && !missing(x)) deparse(substitute(x))
ylab <- if (is.null(xlabel) && !missing(y)) deparse(substitute(y))
...
}
If a label is not set and the variable is present, then the expression used to generate the value of x
is used as a default value for the label on the x axis.
data.frame()
does a similar thing. It automatically labels variables with the expression used to compute them:
x <- 1:4
y <- letters[1:4]
names(data.frame(x, y))
This wouldn't be possible in most programming langauges because functions usually only see values (e.g. 1:4
and c("a", "b", "c", "d")
), not the expressions that created them (x
and y
).
Just printing out the expression used to generate an argument value is useful, but we can do even more with the unevaluated function. For example, take subset()
. It's a useful interactive shortcut for subsetting data frames: instead of repeating the data frame you're working with again and again, you can save some typing:
subset(mtcars, cyl == 4)
# equivalent to:
# mtcars[mtcars$cyl == 4, ]
subset(mtcars, vs == am)
# equivalent to:
# mtcars[mtcars$vs == mtcars$am, ]
Subset is special because vs == am
or cyl == 4
aren't evaluated in the global environment: instead they're evaluated in the data frame. In other words, subset()
implements different scoping rules so instead of looking for those variables in the current environment, subset()
looks in the specified data frame. This is called non-standard evaluation: you are deliberately breaking R's usual rules in order to do something special.
How does subset()
work? We've already seen how to capture the expression that represents an argument, rather than its value, so we just need to figure out how to evaluate that expression in the right context: i.e. cyl
should be interpreted as mtcars$cyl
. To do this we need eval()
, which takes an expression and evaluates it in the specified environment.
But before we can do that, we need to learn one more useful function: quote()
. It's similar to substitute()
but it always gives you back exactly the expression you entered. This makes it useful for interactive experimentation.
quote(1:10)
quote(x)
quote(x + y ^ 2 / z + exp(a * sin(b)))
Now let's experiment with eval()
. If you only provide one argument, it evaluates the expression in the current environment. This makes eval(quote(x))
exactly equivalent to typing x
, regardless of what x
is:
eval(quote(x <- 1))
eval(quote(x))
eval(quote(cyl))
Note that quote()
and eval()
are basically opposites. In the example below, each eval()
peels off one layer of quoting.
quote(2 + 2)
eval(quote(2 + 2))
quote(quote(2 + 2))
eval(quote(quote(2 + 2)))
eval(eval(quote(quote(2 + 2))))
What will this code return?
eval(quote(eval(quote(eval(quote(2 + 2))))))
The second argument to eval()
controls which environment the code is evaluated in:
x <- 10
eval(quote(x))
e <- new.env()
e$x <- 20
eval(quote(x), e)
Instead of an environment, the second argument can also be a list or a data frame. This works because an environment is basically a set of mappings between names and values, in the same way as a list or data frame.
eval(quote(x), list(x = 30))
eval(quote(x), data.frame(x = 40))
This is basically what we want for subset()
:
eval(quote(cyl == 4), mtcars)
eval(quote(vs == am), mtcars)
We can combine eval()
and substitute()
together to write subset()
: we can capture the call representing the condition, evaluate it in the context of the data frame, and then use the result for subsetting:
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x)
x[r, ]
}
subset2(mtcars, cyl == 6)
When you first start using eval()
it's easy to make mistakes. Here's a common one: forgetting to quote the input:
eval(cyl, mtcars)
# Carefully look at the difference to this error
eval(quote(cyl), mtcars)
-
The real subset function (
subset.data.frame()
) does two other things to the result. What are they? -
The other component of the real subset function is variable selection. It allows you to work with variable names like they are positions, so you can do things like
subset(mtcars, , -cyl)
to drop the cylinder variable, orsubset(mtcars, , disp:drat)
to select all the variables betweendisp
anddrat
. How does select work? I've made it easier to understand by extracting it out into its own function:select <- function(df, vars) { vars <- substitute(vars) var_pos <- setNames(as.list(seq_along(df)), names(df)) pos <- eval(vars, var_pos) df[, pos, drop = FALSE] } select(mtcars, -cyl)
-
What does
evalq()
do? Use it to reduce the amount of typing for the examples above that use botheval()
andquote()
While it certainly looks like our subset2()
function works, whenever we're working with expressions instead of values, we need to test a little more carefully. For example, you might expect that the following uses of subset2()
should all return the same value because each variable refers to the same value:
y <- 4
x <- 4
condition <- 4
condition_call <- 4
subset2(mtcars, cyl == 4)
subset2(mtcars, cyl == y)
subset2(mtcars, cyl == x)
subset2(mtcars, cyl == condition)
subset2(mtcars, cyl == condition_call)
What's going wrong? You can get a hint from the variable names I've chosen: they are all variables defined inside subset2()
. It seems like if eval()
can't find the variable instead of the data frame (it's second argument), it's looking in the function environment. That's obviously not what we want, so we need some way to tell eval()
to look somewhere else if it can't find the variables in the data frame.
The key is the third argument: enclos
. This allows us to specify the parent (or enclosing) environment for objects that don't have one like lists and data frames (enclos
is ignored if we pass in a real environment). The enclos
ing environment is where any objects that aren't found in the data frame will be looked for. By default it uses the environment of the current function, which is not what we want.
We want to look for x
in the environment in which subset
was called. In R terminology this is called the parent frame and is accessed with parent.frame()
. This is an example of dynamic scope. With this modification our function works:
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x, parent.frame())
x[r, ]
}
x <- 4
subset2(mtcars, cyl == x)
Using enclos
is just a short cut for converting a list or data frame to an environment with the desired parent yourself. We can use the list2env()
to turn a list into an environment and explicitly set the parent ourselves:
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
env <- list2env(x, parent = parent.frame())
r <- eval(condition_call, env)
x[r, ]
}
x <- 4
subset2(mtcars, cyl == x)
When evaluating code in a non-standard way, it's also a good idea to test your code works when run outside of the global environment:
f <- function() {
x <- 6
subset(mtcars, cyl == x)
}
f()
And indeed it now works.
-
plyr::arrange()
works similarly tosubset()
, but instead of selecting rows, it reorders them. How does it work? What doessubstitute(order(...))
do? -
What does
transform()
do? (Hint: read the documentation). How does it work? (Hint: read the source code fortransform.data.frame
) What doessubstitute(list(...))
do? (Hint: create a function that does only that and experiment with it). -
plyr::mutate()
is similar totransform()
but it applies the transformations sequentially so that transformation can refer to columns that were just created:df <- data.frame(x = 1:5) transform(df, x2 = x * x, x3 = x2 * x) plyr::mutate(df, x2 = x * x, x3 = x2 * x)
How does mutate work? What's the key difference between mutate and transform?
-
What does
with()
do? How does it work? (Read the source code forwith.default()
) -
What does
within()
do? How does it work? (Read the source code forwithin.data.frame()
). What makes the code so much more complicated thanwith()
?
Typically, computing on the language is most useful for functions called directly by the user, not by other functions. While subset
saves typing, it has one big disadvantage: it's now difficult to use non-interactively, e.g. from another function. For example, you might try using subset()
from within a function that is given the name of a variable and it's desired value:
colname <- "cyl"
val <- 6
subset(mtcars, colname == val)
# Zero rows because "cyl" != 6
Or imagine we want to create a function that randomly reorders a subset of the data. A nice way to write that function would be to write a function for random reordering and a function for subsetting (that we already have!) and combine the two together. Let's try that:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) {
scramble(subset(x, condition))
}
But when we run that we get:
subscramble(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found
traceback()
# 5: eval(expr, envir, enclos)
# 4: eval(condition_call, x)
# 3: subset(x, condition)
# 2: scramble(subset(x, condition))
# 1: subscramble(mtcars, cyl == 4)
What's gone wrong? To figure it out, lets debug()
subset and work through the code line-by-line:
> debugonce(subset)
> subscramble(mtcars, cyl == 4)
debugging in: subset(x, condition)
debug: {
condition_call <- substitute(condition)
r <- eval(condition_call, x)
x[r, ]
}
Browse[2]> n
debug: condition_call <- substitute(condition)
Browse[2]> n
debug: r <- eval(condition_call, x)
Browse[2]> condition_call
condition
Browse[2]> eval(condition_call, x)
Error in eval(expr, envir, enclos) : object 'cyl' not found
Browse[2]> condition
Error: object 'cyl' not found
In addition: Warning messages:
1: restarting interrupted promise evaluation
2: restarting interrupted promise evaluation
Can you see what the problem is? condition_call
contains the expression condition
so when we try to evaluate that it evaluates condition
which has the value cyl == 4
. This can't be computed in the parent environment because it doesn't contain an object called cyl
. If cyl
is set in the global environment, far more confusing things can happen:
cyl <- 4
subscramble(mtcars, cyl == 4)
cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)
This is an example of the general tension between functions that are designed for interactive use, and functions that are safe to program with. A function that uses substitute()
might save typing, but it's difficult to call from another function. As a developer you should also provide an alternative version that works when passed a quoted expression. For example, we could rewrite:
subset2_q <- function(x, condition) {
r <- eval(condition, x, parent.frame())
x[r, ]
}
subset2 <- function(x, condition) {
subset2_q(x, substitute(condition))
}
subscramble <- function(x, condition) {
condition <- substitute(condition)
scramble(subset2_q(x, condition))
}
I usually suffix these functions with q
to indicate that they take a quoted call. Most users won't need them so the name can be a little longer.
You might wonder why the function couldn't do this automatically:
subset <- function(x, condition) {
if (!is.call(condition)) {
condition <- substitute(condition)
}
r <- eval(condition, x)
x[r, ]
}
subset(mtcars, quote(cyl == 4))
subset(mtcars, cyl == 4)
But hopefully a little thought, or maybe some experimentation, will show why this doesn't work.
Following the examples above, whenever you write your own functions that use non-standard evaluation, you should always provide alternatives that others can use. But what happens if you want to call a function that uses non-standard evaluation and doesn't have form that takes expressions? For example, imagine you want to create a lattice graphic given the names of two variables:
library(lattice)
xyplot(mpg ~ displ, data = mtcars)
x <- quote(mpg)
y <- quote(displ)
xyplot(x ~ y, data = mtcars)
Again, we can turn to substitute and use it for another purpose: modifying expressions. So far we've just used substitute()
to capture the unevaluated expression associated with arguments, but it can actually do much much more, and is a very useful for manipulating expressions in general.
Unfortunately substitute()
has a "feature" that makes experimenting with it interactively a bit of a pain: it never does substitutions when run from the global environment, and just behaves like quote()
:
a <- 1
b <- 2
substitute(a + b + x)
But if we run it inside a function, substitute()
substitutes what it can and leaves everything else the same:
f <- function() {
a <- 1
b <- 2
substitute(a + b + x)
}
f()
To make it easier to experiment with substitute()
, pryr
provides the subs()
function. It works exactly the same way as substitute()
except it has a shorter name and if the second argument is the global environment it turns it into a list. Together, this makes it much easier to experiment with substitution:
subs(a + b + x)
The second argument (to both sub()
and substitute()
) can override the use of the current environment, and provide an alternative list of name-value pairs to use. The following example uses that technique to show some variations on substituting a string, variable name or function call:
subs(a + b, list(a = "y"))
subs(a + b, list(a = quote(y)))
subs(a + b, list(a = quote(y())))
Remember that every action in R is a function call, so we can also replace +
with another function:
subs(a + b, list("+" = quote(f)))
subs(a + b, list("+" = quote(`*`)))
Note that it's quite possible to make nonsense commands with substitute
:
subs(y <- y + 1, list(y = 1))
And you can use substitute to insert any arbitrary object into an expression. This is technically ok, but often results in surprisingly and undesirable behaviour. In the example below, the expression we create doesn't print correctly, but it returns the correct result when we evaluate it:
df <- data.frame(x = 1)
(x <- subs(class(df)))
eval(x)
Formally, substitution takes place by examining each name in the expression. If the name refers to:
-
an ordinary variable, it's replaced by the value of the variable.
-
a promise, it's replaced by the expression associated with the promise.
-
...
, it's replaced by the contents of...
(only if the substitution occurs in a function)
Otherwise the name is left as is.
We can use this to create the right call to xyplot
:
x <- quote(mpg)
y <- quote(displ)
subs(xyplot(x ~ y, data = mtcars))
It's even simpler inside a function, because we don't need to explicitly quote the x and y variables. Following the rules above, substitute()
replaces named arguments with their expressions, not their values:
xyplot2 <- function(x, y, data = data) {
substitute(xyplot(x ~ y, data = data))
}
xyplot2(mpg, displ, data = mtcars)
If we include ...
in the call to substitute, we can add additional arguments to the call:
xyplot3 <- function(x, y, ...) {
substitute(xyplot(x ~ y, ...))
}
xyplot3(mpg, displ, data = mtcars, col = "red", aspect = "xy")
One application of this idea is to make a version of substitute
that evaluates its first argument (i.e. a version that uses standard evaluation). Note the following example:
x <- quote(a + b)
substitute(x, list(a = 1, b = 2))
Instead we can use pryr::substitute2
:
x <- quote(a + b)
substitute2(x, list(a = 1, b = 2))
The implementation of substitute2
is short, but deep:
substitute2 <- function(x, env) {
call <- substitute(substitute(y, env), list(y = x))
eval(call)
}
Let's work through the example above: substitute2(x, list(a = 1, b = 2))
. It's a little tricky because of substitute()
's non-standard evaluation rules, we can't use the usual technique of working through the parentheses inside-out.
-
First
substitute(substitute(y, env), list(y = x))
is evaluated. The first argument is specially evaluated in the environment containing only one item, the value ofx
with the namey
. Because we've putx
inside a list, it will be evaluated and the rules of substitute will replacey
with it's value. This yields the expressionsubstitute(a + b, env)
-
Next we evaluate that expression inside the current function.
substitute()
specially evaluates its first argument, and looks for name value pairs inenv
, which evaluates tolist(a = 1, b = 2)
. Those are both values (not promises) so the result will bea + b
Another frequently useful technique is to capture all of the unevaluated expressions in ...
. Base R functions do this in many ways, but there's one technique that works well in a wide variety of situations:
dots <- function(...) {
eval(substitute(alist(...)))
}
This uses the alist()
function which simply captures all its arguments. This function is the same as pryr::dots()
, and pryr also provides pryr::named_dots()
, which ensures all arguments are named, using the deparsed expressions as default names.
There are usually two principles you can follow when modelling the evaluation of R code:
-
If the underlying values are the same, the results will the same. i.e. the three results will all be the same:
x <- 10; y <- 10 f(10); f(x); f(y)
-
You can model evaluation by working from the innermost parentheses to the outermost.
Non-standard evaluation can break both principles. This makes the mental model needed to correctly predict the output much more complicated, so it's only worthwhile to do so if there is significant gain.
For example, library()
and require()
allow you to call them either with or without quotes, because internally they use deparse(substitute(x))
plus a couple of tricks. That means that these two lines do exactly the same thing:
library(ggplot2)
library("ggplot2")
However, things start to get complicated if the variable has a value.. What do you think the following lines of code will do?
ggplot2 <- "plyr"
library(ggplot2)
It loads ggplot2, not plyr. If you want to load plyr (the value of the ggplot2 variable), you need to use an additional argument:
library(ggplot2, character.only = TRUE)
Using an argument to change the behaviour of another argument is not a great idea because it means you must completely and carefully read all of the function arguments to understand what one function argument means. You can't understand the effect of each argument in isolation, and hence it's harder to reason about the results of a function call.
There are a number of other R functions that use substitute()
and deparse()
in this way: ls()
, rm()
, data()
, demo()
, example()
, vignette()
. These all use non-standard evaluation and then have a special ways of enforcing the usual rules. To me, eliminating two quotes is not worth the cognitive cost of non-standard evaluation, and I don't recommend you use substitute()
for this purpose.
One situtation where non-standard evaluation is more useful is data.frame()
, which uses the input expressions to automatically name the output variables if not otherwise provided:
x <- 10
y <- "a"
df <- data.frame(x, y)
names(df)
I think it is worthwhile in data.frame()
because it eliminates a lot of redundancy in the common scenario when you're creating a data frame from existing variables, and importantly, it's easy to override this behaviour by supplying names for each variable.
The code for data.frame()
is rather complicated, but we can create our own simple version for lists to see how a function that does this might work. The key is pryr::named_dots()
, a function which returns the unevaluated ... arguments, with default names. Then it's just a matter of arranging the evaluated results in a list:
list2 <- function(...) {
dots <- named_dots(...)
lapply(dots, eval, parent.frame())
}
x <- 1; y <- 2
list2(x, y)
list2(x, z = y)
To show how I've used some of these ideas in practice, the following two sections show applications of non-standard evaluation to plyr and ggplot2.
Both plyr and ggplot2 have ways of capturing what you want to do, and then performing that action later. ggplot2 uses the aes()
to define a set of mappings between variables in your data and visual properties on your graphic. plyr uses the .
function to capture the names (or more complicated expressions) of variables used to split a data frame into pieces. Let's look at the code:
. <- function (..., .env = parent.frame()) {
structure(
as.list(match.call()[-1]),
env = .env,
class = "quoted"
)
}
aes <- function (x = NULL, y = NULL, ...) {
aes <- structure(
as.list(match.call()[-1]),
class = "uneval")
class(aes) <- "uneval"
ggplot2:::rename_aes(aes)
}
Both functions were written when I didn't know so much about non-standard evaluation, and if I was to write them today, I'd use the dots()
helper function I showed previously. I'd also think more about the environment in which the results of aes()
should be evaluated, and how that integrates with ggplot2's rules for aesthetic mapping inheritance. That's a bit murky at the moment and leads to confusion when creating complex graphics across multiple functions.
ggplot2 and plyr provide slightly different ways to use standard evaluation so that you can refer to variables by reference. ggplot2 provides aes_string()
which allows you to specify variables by the string representation of their name, and plyr uses S3 methods so that you can either supply an object of class quoted (as created with .()
), or a regular character vector.
The plyr package also uses non-standard evaluation to complete the set of tools provided by the base subset()
and transform()
functions with mutate()
, summarise()
and arrange()
. Each of these functions has the same interface: the first argument is a data frame and the subsequent arguments are evaluated in the context of that data frame (i.e. they look there first for variables, and then in the current environment) and they return a data frame.
The following code shows the essence of how these four functions work:
subset2 <- function(.data, subset) {
sub <- eval(substitute(subset), .data, parent.frame())
sub <- sub & !is.na(sub)
.data[r, , drop = FALSE]
}
arrange2 <- function (.data, ...) {
ord <- eval(substitute(order(...)), .data, parent.frame())
.data[ord, , drop = FALSE]
}
mutate2 <- function(.data, ...) {
cols <- named_dots(...)
data_env <- eval_df(.data, parent.frame(), cols)
out_cols <- union(names(.data), names(cols))
quickdf(mget(out_cols, data_env))
}
summarise2 <- function (.data, ...) {
cols <- named_dots(...)
data_env <- eval_df(.data, parent.frame(), cols)
quickdf(mget(names(cols), env))
}
eval_df <- function(data, env, expr) {
data_env <- list2env(data, parent = env)
for(nm in names(exprs)) {
data_env[[nm]] <- eval(data_env[[nm]], env)
}
data_env
}
You might be surprised to see the for loops in eval_df
, but they are necessary because the computation of one variable might depend on the results of previous variables (this is the key difference between mutate()
and transform()
).
Combined with a by operator (e.g. ddply()
) these four functions allow you to express the majority of data manipulation operations. Then when you have a new problem, solving it becomes a matter of thinking about which operations you need to apply and in what order. The realm of possible actions has been shrunk to a manageable number.
Now that you understand how our version of subset works, go back and read the source code for subset.data.frame
, the base R version which does a little more. Other functions that work similarly are with.default
, within.data.frame
, transform.data.frame
, and in the plyr package .
, arrange
, and summarise
. Look at the source code for these functions and see if you can figure out how they work.