- R was created by statisticians for statisticians (and other researchers)
- R contains multitudes; this can be good and bad
- PC/Linux: Tools > Global Options
- MacOS: RStudio > Preferences or Tools > Global Options
- General > Basic
- Don’t save or restore .RData
- Code > Editing
- Use native pipe operator
- Ctrl+Enter executes single line (or Multi-line R statement)
- Code > Display
- Rainbow parentheses
- Appearance: Adjust font and syntax colors
- Pane Layout: Move IDE panes
By default, your view of your file system will be opaque. We want to make it transparent (e.g. you may have a local Desktop and a cloud Desktop folder).
Your local Desktop folder is in your Home directory.
- General
- New finder window shows: /Users/<home>
- Sidebar
- Favorites: /Users/<home>
- iCloud: iCloud Drive
- Locations: <computer name>, Cloud Storage
- Advanced
- Show all filename extensions
- Keep folders on top (all)
Your local Desktop folder is in your Home directory or Computer directory.
- File > Change folder and search options > View
- Files and Folders
- Show hidden files, folders, and drives
- Hide protected operating system files
- Uncheck Hide extensions for known file types
- Navigation Pane
- Show all folders
- Files and Folders
- View
- File name extensions
- Set working directory
- Test code snippets in the R console [REPL]
print("hello")
- Create an .R script in the working directory
print("hello")
- Run the script
- Keyboard shortcut
- Windows/Linux:
Control-Enter
- MacOS:
Command-Enter
- Windows/Linux:
- Run button
- Highlight and run lines
- Keyboard shortcut
- Source the script to reduce console clutter and make contents available to other scripts
- Insert assignment arrow
<-
- MacOS:
Option -
- Windows/Linux:
Alt -
- Good customization:
Control -
- MacOS:
- Break execution if console hangs
- Windows:
ESC
- MacOS/Linux:
Control-c
- Windows:
- Clear console
- RStudio:
C-l
- Emacs:
C-c M-o
/M-x comint-clear-buffer
- RStudio:
- Comment/Uncomment code
- MacOS:
Command-/
- MacOS:
A whirlwind tour of R fundamentals
1 + 100
(3 + 5) * 2 # operator precedence
5 * (3 ^ 2) # powers
2/10000 # outputs 2e-04
2 * 10^(-4) # 2e-04 explicated
- Some functions need inputs (“arguments”)
getwd() # no argument required sin(1) # requires arg log(1) # natural log
- RStudio has auto-completion
log...
- Use
help()
to find out more about a functionhelp(exp) exp(0.5) # e^(1/2)
- Basic comparisons
1 == 1 1 != 2 1 < 2 1 <= 1
- Use
all.equal()
for floating point numbersall.equal(3.0, 3.0) # TRUE all.equal(2.99, 3.0) # 7 places: Gives difference all.equal(2.99999999, 3.0) # 8 places: TRUE 2.99999999 == 3.0 # 8 places: FALSE
- R uses the assignment arrow (
C-c C-=
in ESS)# Assign a value to the variable name x <- 0.025
- You can inspect a variable’s value in the Environment tab or by evaluating it in the console
# Evaluate the variable and echo its value to the console x
- Variables can be re-used and re-assigned
log(x) x <- 100 x <- x + 1 y <- x * 2
- Use a standard naming scheme for your variables
r.style.variable <- 10 python_style_variable <- 11 javaStyleVariable <- 12
Vectorize all the things! This makes idiomatic R very different from most programming languages, which use iteration (“for” loops) by default.
# Create a sequence 1 - 5
1:5
# Raise 2 to the Nth power for each element of the sequence
2^(1:5)
# Assign the resulting vector to a variable
v <- 1:5
2^v
ls() # List the objects in the environment
ls # Echo the contents of ls(), i.e. the code
rm(x) # Remove the x object
rm(list = ls()) # Remove all objects in environment
Note that parameter passing (=
) is not the same as assignment (<-
) in R!
data()
“Package” and “library” are roughly interchangeable.
- Install additional packages
install.packages("tidyverse") ## install.packages("rmarkdown")
- Activate a package for use
library("tidyverse")
See /scripts/curriculum.Rmd
project_name ├── project_name.Rproj ├── README.md ├── script_1.R ├── script_2.R ├── data │ ├── processed │ └── raw ├── results └── temp
- File > New Project
- Create in existing Folder
- If you close RStudio and double-click Rproj, RStudio will open to the project location and set the working directory.
help(write.csv)
?write.csv
- Description
- Usage
- Arguments
- Details
- Examples (highlight and run with
C-Enter
)
help("<-")
vignette("dplyr")
- RStudio autocomplete
- Fuzzy search
??set
- Browse by topic: https://cran.r-project.org/web/views/
There are no scalars in R; everything is a vector, even if it’s a vector of length 1.
v <- 1:5
length(v)
length(3.14)
There are 5 basic (vector) data types: double, integer, complex, logical and character.
typeof(v)
typeof(3.14)
typeof(1L)
typeof(1+1i)
typeof(TRUE)
typeof("banana")
- A vector must be all one type. If you mix types, R will perform type coercion.
See coercion rules in scripts/curriculum.Rmd
c(2, 6, '3') c(0, TRUE)
- You can change vector types
# Create a character vector chr_vector <- c('0', '2', '4') str(chr_vector) # Use it to create a numeric vector num_vector <- as.numeric(chr_vector) # Show the structure of the collection str(num_vector)
- There are multiple ways to generate vectors
# Two options for generating sequences 1:10 seq(10) # The seq() function is more flexible series <- seq(1, 10, by=0.1) series
- Get information about a collection
# Don't print everything to the screen length(series) head(series) tail(series, n=2)
# You can add informative labels to most things in R names(v) <- c("a", "b", "c", "d", "e") v str(v)
- Get an item by its position or label
v[1] v["a"]
- Set an item by its position or label
v[1] = 4 v
- (Optional) New vectors are empty by default
# Vectors are logical by default vector1 <- vector(length = 3) vector1 # You can specify the type of an empty vector vector2 <- vector(mode="character", length = 3) vector2 str(vector2)
See /scripts/curriculum.Rmd
- A matrix is 2-dimensional vector
# Create a matrix of zeros mat1 <- matrix(0, ncol = 6, nrow = 3) # Inspect it class(mat1) typeof(mat1) str(mat1)
- Some operations act as if the matrix is a 1-D wrapped vector
mat2 <- matrix(1:25, nrow = 5, byrow = TRUE) str(mat2) length(mat2)
- Factors represent unique levels (e.g., experimental conditions)
coats <- c("tabby", "tortoise", "tortoise", "black", "tabby") str(coats) # The reprentation has 3 levels, some of which have multiple instances categories <- factor(coats) str(categories)
- R assumes that the first factor represents the baseline level, so you may need to change your factor ordering so that it makes sense for your variables
## "control" should be the baseline, regardless of trial order trials <- c("manipulation", "control", "control", "manipulation") trial_factors <- factor(trials, levels = c("control", "manipulation")) str(trial_factors)
- Create a data frame
coat = c("calico", "black", "tabby") weight = c(2.1, 5.0, 3.2) chases_bugs = c(1, 0, 1) cats <- data.frame(coat, weight, chases_bugs) cats # show contents of data frame str(cats) # inspect structure of data frame # Convert chases_bugs to logical vector cats$chases_bugs <- as.logical(cats$chases_bugs) str(cats)
- Write the data frame to a CSV and re-import it. You can use
read.delim()
for tab-delimited files, orread.table()
for flexible, general-purpose input.write.csv(x = cats, file = "../data/feline_data.csv", row.names = FALSE) cats <- read.csv(file = "../data/feline_data.csv", stringsAsFactors = TRUE) str(cats) # the chr column is now a factor column
- Access the column (vectors) of the data frame
cats$weight cats$coat
- A vector can only hold one type. Therefore, in a data frame each data column (vector) has to be a single type.
typeof(cats$weight)
- Use data frame vectors in operations
cats$weight + 2 paste("My cat is", cats$coat) # Operations have to be legal for the data type cats$coat + 2 # Operations are ephemeral unless their outputs are reassigned to the variable cats <- cats$weight + 1
- Data frames have column names
names()
gets or sets a namenames(cats) names(cats)[2] <- "weight_kg" cats
- Lists can contain anything
list1 <- list(1, "a", TRUE, 1+4i) # Inspect each element of the list list1[[1]] list1[[2]] list1[[3]] list1[[4]]
If you use a single bracket
[]
, you get back a shorter section of the list, which is also a list. Use double brackets[[]]
to drill down to the actual value. - (Optional) This includes complex data structures
list2 <- list(title = "Numbers", numbers = 1:10, data = TRUE) # Single brackets retrieve a slice of the list, containing the name:value pair list2[2] # Double brackets retrieve the value, i.e. the contents of the list item list2[[2]]
- Data frames are lists of vectors and factors
typeof(cats)
- Some operations return lists, others return vectors (basically, are you getting the column with its label, or are you drilling down to the data?)
- Get list slices
# List slices cats[1] # list slice by index cats["coat"] # list slice by name cats[1, ] # get data frame row by row number
- Get list contents (in this case, vectors)
# List contents (in this case, vectors) cats[[1]] # content by index cats[["coat"]] # content by name cats$coat # content by name; shorthand for `cats[["coat"]]` cats[, 1] # content by index, across all rows cats[1, 1] # content by index, single row
- You can inspect all of these with
typeof()
- Note that you can address data frames by row and columns
- Get list slices
See /scripts/curriculum.Rmd
age <- c(2, 3, 5)
cbind(cats, age)
cats # cats is unchanged
cats <- cbind(cats, age) # overwrite old cats
# Data frames enforce consistency
age <- c(2, 5)
cats <- cbind(cats, age)
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
# Legal values added, illegal values are NA
cats
# Update the factor set so that "tortoiseshell" is a legal value
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
cats
is now polluted with missing data
na.omit(cats)
cats
cats <- na.omit(cats)
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)
# Get an overview of the data frame
str(gapminder)
dim(gapminder)
# It's a list
length(gapminder)
colnames(gapminder)
# Look at the data
summary(gapminder$gdpPercap) # summary varies by data type
head(gapminder)
See /scripts/curriculum.Rmd
v <- 1:5
- Index selection
v[1] v[1:3] # index range v[c(1, 3)] # selected indices
- (Optional) Index exclusion
v[-1] v[-c(1, 3)]
letters[1:5]
names(v) <- letters[1:5]
- Character selection
v["a"] v[names(v) %in% c("a", "c")]
- (Optional) Character exclusion
v[! names(v) %in% c("a", "c")]
m <- matrix(1:28, nrow = 7, byrow = TRUE)
# Matrices are just 2D vectors
m[2:4, 1:3]
m[c(1, 3, 5), c(2, 4)]
Single brackets get you subsets of the same type (list -> list
, vector -> vector
, etc.). Double brackets extract the underlying vector from a list or data frame.
# Create a new list and give it names
l <- replicate(5, sample(15), simplify = FALSE)
names(l) <- letters[1:5]
# You can extract one element
l[[1]]
l[["a"]]
# You can't extract multiple elements
l[[1:3]]
l[[names(l) %in% c("a", "c")]]
- Explicitly mask each item using TRUE or FALSE. This returns the reduced vector.
v[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
- Evaluate the truth of each item, then produce the TRUE ones
# Use a criterion to generate a truth vector v > 4 # Filter the original vector by the criterion v[v > 4]
- Combining logical operations
v[v < 3 | v > 4]
# First three items
gapminder$country[1:3]
# All items in factor set
north_america <- c("Canada", "Mexico", "United States")
gapminder$country[gapminder$country %in% north_america]
Data frames have characteristics of both lists and matrices.
- Get first three rows
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE) # Get first three rows gapminder[1:3,]
- Rows and columns
gapminder[1:6, 1:3] gapminder[1:6, c("country", "pop")]
- Data frames are lists, so one index gets you the columns
gapminder[1:3]
- Filter by contents
gapminder[gapminder$country == "Mexico",] north_america <- c("Canada", "Mexico", "United States") gapminder[gapminder$country %in% north_america,] gapminder[gapminder$country %in% north_america & gapminder$year > 1999,] gapminder[gapminder$country %in% north_america & gapminder$year > 1999, c("country", "pop")]
See /scripts/curriculum.Rmd
- Look at Conditional template in curriculum.Rmd
- If
x <- 8 if (x >= 10) { print("x is greater than or equal to 10") }
- Else
if (x >= 10) { print("x is greater than or equal to 10") } else { print("x is less than 10") }
- Else If
if (x >= 10) { print("x is greater than or equal to 10") } else if (x > 5) { print("x is greater than 5, but less than 10") } else { print("x is less than 5") }
- Vectorize your tests
x <- 1:4 if (any(x < 2)) { print("Some x less than 2") } if (all(x < 2)){ print("All x less than 2") }
Subsetting is frequently an alternative to if-else statements in R
- Look at Iteration template in curriculum.Rmd
- Basic For loop
for (i in 1:10) { print(i) }
- Nested For loop
for (i in 1:5) { for (j in letters[1:4]) { print(paste(i,j)) } }
- This is where we skip the example where we append things to the end of a data frame. For loops are slow, vectorize operations are fast (and idiomatic). Use for loops where they’re the appropriate tool (e.g., loading files, cycling through whole data sets, etc). We will see more of this in the section on reading and writing data.
x <- 1:4
y <- 6:9
x + y
log(x)
# A more realistic example
gapminder$pop_millions <- gapminder$pop / 1e6
head(gapminder)
z <- 1:2
x + z
- Do the elements match a criterion?
x > 2 a <- (x > 2) # you can assign the output to a variable # Evaluate a boolean vector any(a) all(a)
- Can you detect missing data?
nan_vec <- c(1, 3, NaN) ## Which elements are NaN? is.nan(nan_vec) ## Which elements are not NaN? !is.nan(nan_vec) ## Are any elements NaN? any(is.nan(nan_vec)) ## Are all elements NaN? all(is.nan(nan_vec))
m <- matrix(1:12, nrow=3, ncol=4)
# Multiply each item by -1
m * -1
# Multiply two vectors
1:4 %*% 1:4
# Matrix-wise multiplication
m2 <- matrix(1, nrow = 4, ncol = 1)
m2
m %*% m2
# Most functions operate on the whole vector or matrix
mean(m)
sum(m)
See /scripts/curriculum.Rmd
apply()
lets you apply an arbitrary function over a collection. This is an example of a higher-order function (map, apply, filter, reduce, fold, etc.) that can (and should) replace loops for most purposes. They are an intermediate case between vectorized operations (very fast) and for loops (very slow). Use them when you need to build a new collection and vectorized operations aren’t available.
m <- matrix(1:28, nrow = 7, byrow = TRUE)
apply(m, 1, mean)
apply(m, 2, mean)
apply(m, 1, sum)
apply(m, 2, sum)
lst <- list(title = "Numbers", numbers = 1:10, data = TRUE)
## length() returns the length of the whole list
length(lst)
## Use lapply() to get the length of the individual elements
lapply(lst, length)
sapply()
: Apply a function polymorphically over list, returning vector, matrix, or array as appropriate
## Simplify and return a vector by default
sapply(lst, length)
## Optionally, eturn the original data type
sapply(lst, length, simplify = FALSE)
- Read a file JSON into a nested list
## Read JSON file into nested list library("jsonlite") books <- fromJSON("../data/books.json") ## View list structure str(books)
- Extract all of the authors with
lapply()
. This requires us to define an anonymous function.## Extract a single author books[["bk110"]]$author ## Use lapply to extract all the authors authors <- lapply(books, function(x) x$author) ## Returns list str(authors)
- Extract all of the authors with
sapply()
authors <- sapply(books, function(x) x$author) # Returns vector str(authors)
- Method 1: Create a list of data frames, then bind them together into a single data frame
## This approach omits the top-level book id df <- do.call(rbind, lapply(books, data.frame))
lapply()
applies a given function for each element in a list, so there will be several function calls.do.call()
applies a given function to the list as a whole, so there is only one function call.
- Method 2: Use the
rbindlist()
function from data.table## This approach includes the top-level book id df <- data.table::rbindlist(books, idcol = TRUE)
Functions let you encapsulate and re-use chunks of code. This has several benefits:
- Eliminates repetition in your code. This saves labor, but more importantly it reduces errors, and makes it easier for you to find and correct errors.
- Allows you to write more generic (i.e. flexible) code.
- Reduces cognitive overhead.
- Look at Function template in data/curriculum.Rmd
- Define a simple function
# Convert Fahrenheit to Celcius f_to_celcius <- function(temp) { celcius <- (temp - 32) * (5/9) return(celcius) }
- Call the function
f_to_celcius(32) boiling <- f_to_celcius(212)
Define a second function and call the first function within the second.
f_to_kelvin <- function(temp) {
celcius <- f_to_celcius(temp)
kelvin <- celcius + 273.15
return(kelvin)
}
f_to_kelvin(212)
## Create a vector of temperatures
temps <- seq(from = 1, to = 101, by = 10)
# Vectorized calculation (fast)
f_to_kelvin(temps)
# Apply
sapply(temps, f_to_kelvin)
- Check whether input meets criteria before proceeding (this is `assert` in other languages).
f_to_celcius <- function(temp) { ## Check inputs stopifnot(is.numeric(temp), temp > -460) celcius <- (temp - 32) * (5/9) return(celcius) } f_to_celcius("a") f_to_celcius(-470)
- Fail with a custom error if criterion not met
f_to_celcius <- function(temp) { if(!is.numeric(temp)) { stop("temp must be a numeric vector") } celcius <- (temp - 32) * (5/9) return(celcius) }
## Prerequisites
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)
north_america <- c("Canada", "Mexico", "United States")
- Calculate the total GDP for each entry in the data set
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE) gdp <- gapminder$pop * gapminder$gdpPercap
- Write a function to perform a total GDP calculation on a filtered subset of your data.
calcGDP <- function(df, year=NULL, country=NULL) { if(!is.null(year)) { df <- df[df$year %in% year, ] } if (!is.null(country)) { df <- df[df$country %in% country,] } gdp <- df$pop * df$gdpPercap new_df <- cbind(df, gdp=gdp) return(new_df) }
- Mutating
df
inside the function doesn’t affect the globalgapminder
data frame (because of pass-by-value and scope).
See data/curriculum.Rmd
- Preliminaries
if (!dir.exists("../processed")) { dir.create("../processed") } north_america <- c("Canada", "Mexico", "United States")
- Version 1: Use
calcGDP
functionfor (year in unique(gapminder$year)) { df <- calcGDP(gapminder, year = year, country = north_america) ## Generate a file name. This will fail if "processed" doesn't exist fname <- paste("../processed/north_america_", as.character(year), ".csv", sep = "") ## Write the file write.csv(x = df, file = fname, row.names = FALSE) }
- Version 2: Bypass
calcGDP
functionfor (year in unique(gapminder$year)) { df <- gapminder[gapminder$year == year, ] df <- df[df$country %in% north_america, ] fname <- paste("processed/north_america_", as.character(year), ".csv", sep="") write.csv(x = df, file = fname, row.names = FALSE) }
## Get matching files from the `processed` subdirectory
dir(path = "../processed", pattern = "north_america_[1-9]*.csv")
- Read each file into a data frame and add it to a list
## Create an empty list df_list <- list() ## Get the locations of the matching files file_names <- dir(path = "../processed", pattern = "north_america_[1-9]*.csv") file_paths <- file.path("../processed", file_names) for (f in file_paths){ df_list[[f]] <- read.csv(f, stringsAsFactors = TRUE) }
- Access the list items to view the individual data frames
length(df_list) names(df_list) lapply(df_list, length) df_list[["north_america_1952.csv"]]
- Instead of a for loop that handles each file individually, use a single vectorized function.
df_list <- lapply(file_paths, read.csv, stringsAsFactors = TRUE) ## The resulting list does not have names set by default names(df_list) ## You can still access by index position df_list[[2]]
- Add names manually
names(df_list) <- file_names df_list$north_america_1952.csv
- (Optional) Automatically set names for the output list
This example sets each name to the complete path name (e.g.,
"../processed/north_america_1952.csv"
).df_list <- sapply(file_paths, read.csv, simplify = FALSE, USE.NAMES = TRUE)
- Method 1: Create a list of data frames, then bind them together into a single data frame
df <- do.call(rbind, df_list)
lapply()
applies a given function for each element in a list, so there will be several function calls.do.call()
applies a given function to the list as a whole, so there is only one function call.
- (Optional) Method 2: Use the
rbindlist()
function from data.table. This can be faster for large data sets. It also give you the option of preserving the list names (in this case, the source file names) as a new column in the new data frame.df_list <- sapply(file.path("../processed", file_names), read.csv, simplify = FALSE, USE.NAMES = TRUE) df <- data.table::rbindlist(df_list, idcol = TRUE)
library("dplyr")
- Explain Tidyverse briefly: https://www.tidyverse.org/packages/
- (Optional) Demo unix pipes with
history | grep
- Explain tibbles briefly
- dplyr allows you to treat data frames like relational database tables; i.e. as sets
select()
provides a mini-language for selecting data frame variablesdf <- select(gapminder, year, country, gdpPercap) str(df)
select()
understands negation (and many other intuitive operators)df2 <- select(gapminder, -continent) str(df2)
- You can link multiple operations using pipes. This will be more intuitive once we see this combined with
filter()
df <- gapminder %>% select(year, country, gdpPercap) ## You can use the native pipe. This has a few limitations: ## df <- gapminder |> select(year, country, gdpPercap)
- Filter by continent
df_europe <- gapminder %>% filter(continent == "Europe") %>% select(year, country, gdpPercap) str(df_europe)
- Filter by continent and year
europe_2007 <- gapminder %>% filter(continent == "Europe", year == 2007) %>% select(country, lifeExp) str(europe_2007)
See data/curriculum.Rmd
- Group data by a data frame variable
grouped_df <- gapminder %>% group_by(continent) ## This produces a tibble str(grouped_df)
- The grouped data frame contains metadata (i.e. bookkeeping) that tracks the group membership of each row. You can inspect this metadata:
grouped_df %>% tally () grouped_df %>% group_keys () grouped_df %>% group_vars () ## These produce a lot of output: grouped_df %>% group_indices () grouped_df %>% group_rows ()
- More information about grouped data frames: https://dplyr.tidyverse.org/articles/grouping.html
- Calculate mean gdp per capita by continent
grouped_df %>% summarise(mean_gdpPercap = mean(gdpPercap))
- (Optional) Using pipes allows you to do ad hoc reporting with creating intermediate variables
gapminder %>% group_by(continent) %>% summarise(mean_gdpPercap = mean(gdpPercap))
- Group data by multiple variables
df <- gapminder %>% group_by(continent, year) %>% summarise(mean_gdpPercap = mean(gdpPercap))
- Create multiple data summaries
df <- gapminder %>% group_by(continent, year) %>% summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap), mean_pop = mean(pop), sd_pop = sd(pop))
count()
lets you get an ad hoc count of any variablegapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE)
n()
gives the number of observations in a group## Get the standard error of life expectancy by continent gapminder %>% group_by(continent) %>% summarise(se_le = sd(lifeExp)/sqrt(n()))
Mutate creates a new variable within your pipeline
## Total GDP and population by continent and year
df <- gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
group_by(continent, year) %>%
summarise(mean_gdp = mean(gdp_billion),
sd_gdp = sd(gdp_billion),
mean_pop = mean(pop),
sd_pop = sd(pop))
- Perform previous calculation, but only in cases in which the life expectancy is over 25
df <- gapminder %>% mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% group_by(continent, year) %>% summarise(mean_gdp = mean(gdp_billion), sd_gdp = sd(gdp_billion), mean_pop = mean(pop), sd_pop = sd(pop))
- (Optional) Predict future GDP per capita for countries with higher life expectancies
df <- gapminder %>% mutate(gdp_expected = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% group_by(continent, year) %>% summarize(mean_gdpPercap = mean(gdpPercap), mean_gdpPercap_expected = mean(gdp_expected))
gapminder %>%
filter(year == 2002) %>%
group_by(continent) %>%
sample_n(2) %>%
summarize(mean_lifeExp = mean(lifeExp), country = country) %>%
arrange(desc(mean_lifeExp))
- Long format: All rows are unique observations (ideally)
- each column is a variable
- each row is an observation
- Wide format: Rows contain multiple observations
- Repeated measures
- Multiple variables
library("tidyr")
library("dplyr")
str(gapminder)
- 3 ID variables: continent, country, year
- 3 Observation variables: pop, lifeExp, gdpPercap
- Load wide gapminder data
gap_wide <- read.csv("../data/gapminder_wide.csv", stringsAsFactors = FALSE) str(gap_wide)
- Group comparable columns into a single variable. Here we group all of the “pop” columns, all of the “lifeExp” columns, and all of the “gdpPercap” columns.
gap_long <- gap_wide %>% pivot_longer( cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), names_to = "obstype_year", values_to = "obs_values" ) str(gap_long) head(gap_long, n=20)
- Original column headers become keys
- Original column values become values
- This pushes all values into a single column, which is unintuitive. We will generate the intermediate format later.
- (Optional) Same pivot operation as (2), specifying the columns to be omitted rather than included.
gap_long <- gap_wide %>% pivot_longer( cols = c(-continent, -country), names_to = "obstype_year", values_to = "obs_values" ) str(gap_long)
- Split compound variables into individual variables
gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") gap_long$year <- as.integer(gap_long$year)
- Recreate the original gapminder data frame (as a tibble)
## Read in the original data without factors for comparison purposes gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = FALSE) gap_normal <- gap_long %>% pivot_wider(names_from = obstype, values_from = obs_values) str(gap_normal) str(gapminder)
- Rearrange the column order of
gap_normal
so that it matchesgapminder
gap_normal <- gap_normal[, names(gapminder)]
- Check whether the data frames are equivalent (they aren’t yet)
all.equal(gap_normal, gapminder) head(gap_normal) head(gapminder)
- Change the sort order of
gap_normal
so that it matchesgap_normal <- gap_normal %>% arrange(country, year) all.equal(gap_normal, gapminder)
- Create variable labels for wide columns. In this case, the new variables are all combinations of metric (pop, lifeExp, or gdpPercap) and year. Effectively we are squishing many columns together.
help(unite) df_temp <- gap_long %>% ## unite(ID_var, continent, country, sep = "_") %>% unite(var_names, obs_type, year, sep = "_") str(df_temp) head(df_temp, n=20)
- Pivot to wide format, distributing data into columns for each unique label
gap_wide_new <- gap_long %>% ## unite(ID_var, continent, country, sep = "_") %>% unite(var_names, obs_type, year, sep = "_") %>% pivot_wider(names_from = var_names, values_from = obs_values) str(gap_wide_new)
- Sort columns alphabetically by variable name, then check for equality. You can move a single column to a different positions with
relocate()
gap_wide_new <- gap_wide_new[,order(colnames(gap_wide_new))] all.equal(gap_wide, gap_wide_new)
Fast, user-friendly file imports.
Real string processing for R.
Functional programming for the Tidyverse. The map
family of functions replaces the apply
family for most use cases. Map functions are strongly typed. For example, you can use purrr:::map_chr()
to extract nested data from a list:
## View the relevant map function
library("purrr")
library("jsonlite")
help(map_chr)
books <- fromJSON("books.json")
## Returns vector
authors <- map_chr(books, ~.x$author)
- The ~~~ operation in Purrr creates an anonymous function that applies to all the elements in the
.x
collection.- Best overview in
as_mapper()
documentation: https://purrr.tidyverse.org/reference/as_mapper.html - https://stackoverflow.com/a/53160041
- https://stackoverflow.com/a/62488532
- https://stackoverflow.com/a/44834671
- Best overview in
- Additional references
- https://jozef.io/r006-merge/#alternatives-to-base-r
- https://dplyr.tidyverse.org/reference/mutate-joins.html
- Understand your data
- Quality control
- support the selection of statistical procedures
- evaluate whether data conform with assumptions of the statistical tests (e.g.,y normality)
- central tendency measures: mean, median, mode
- variation/dispersion measures: range, range width, variance, standard deviation, variation coefficient
- data distribution: quantiles, inter-quantile ranges, boxplots, histograms.
- relationship between variables: scatterplots, correlations, linear models
data("anscombe")
print(anscombe)
- Central tendency measures
mean(anscombe$x1) apply(anscombe[,1:4], 2, mean) apply(anscombe[,5:8], 2, mean) apply(anscombe, 2, var)
- Correlations
cor(anscombe$x1, anscombe$y1) cor(anscombe$x2, anscombe$y2) cor(anscombe$x3, anscombe$y3) cor(anscombe$x4, anscombe$y4)
- Linear regression parameters
m1 <- lm(anscombe$y1 ~ anscombe$x1) m2 <- lm(anscombe$y2 ~ anscombe$x2) m3 <- lm(anscombe$y3 ~ anscombe$x3) m4 <- lm(anscombe$y4 ~ anscombe$x4) coef(m1) coef(m2) coef(m3) coef(m4)
- Plot the data and regression lines
mlist <- list(m1, m2, m3, m4) lapply(mlist, coef) ## Plots plot(anscombe$y1 ~ anscombe$x1) abline(mlist[[1]]) plot(anscombe$y2 ~ anscombe$x2) abline(mlist[[2]]) plot(anscombe$y3 ~ anscombe$x3) abline(mlist[[3]]) plot(anscombe$y4 ~ anscombe$x4) abline(mlist[[4]])
Separates the data from the aesthetics part and allows layers of information to be added sequentially with `+`
ggplot(data = <data>,
mapping = aes(<mappings>)) +
geom_xxx()
- data
- mappings: the specific variables (x, y, z, group…)
- geom_xxx(): functions for plotting options `geom_point()`, `geom_line()`
- wesanderson
- latticeExtra
- plotrix
- ggplot2
- R for Reproducible Scientific Analysis: https://swcarpentry.github.io/r-novice-gapminder/
- Andrea Sánchez-Tapia’s workshop: https://github.com/AndreaSanchezTapia/UCMerced_R
- Instructor notes for “R for Reproducible Scientific Analysis”: https://swcarpentry.github.io/r-novice-gapminder/guide/
- R for Ecology: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
- R Project documentation: https://cran.r-project.org/manuals.html
- CRAN task views: https://cran.r-project.org/web/views/
- R Cookbook: http://www.cookbook-r.com
- RStudio cheat sheets: https://www.rstudio.com/resources/cheatsheets/
- Matrix algebra operations in R: https://www.statmethods.net/advstats/matrix.html
- RStudio keyboard shortcuts: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
- RStudio shortcuts and tips: https://appsilon.com/rstudio-shortcuts-and-tips/
- Why
typeof()
andclass()
give different outputs: https://stackoverflow.com/a/8857411 - How to get function code from the different object systems: https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
- Various approaches to contrast coding: https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
If you tell R that a factor is ordered, it defaults to Orthogonal polynomial contrasts. This means that it assumes you want it to check for linear, cubic, and quadratic trends. If you tell R that a factor is NOT ordered, it defaults to treatment contrasts: it compares all levels to a reference level. This probably doesn’t make sense for lots of psych data. So if I say income is ordered, it calculates linear, quadratic etc. trends for income, which is not only not what I want, but is inappropriate unless your groups are evenly spaced. Treatment means it calculates whether each level is significantly different from a reference level (i.e. the highest income group).
So if you want first-year stats output in a design with more than 2 levels in the factor, put this at the top of the R code:
options(contrasts = c("contr.sum","contr.poly"))
contr.sum
is R for deviation contrasts, which you may recall as contrasts like -1, 0, 1.
- Gapminder data:
- JSON derived from Microsoft sample XML file: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)