This repository contains the course materials for Data Academy’s Introduction to R course.
-
Clone/download the repository onto your computer. Then open
data-academy-intro-to-r.Rproj
in RStudio (should just be able to double-click the file). -
Open
R/install_packages.R
and click ‘Source’. This should install the required packages for the workshop. -
During the workshop, open
R/script.R
and follow along, executing each line of code. -
After the workshop, complete the lab exercises below in a new R script.
In a new script, load the tidyverse and complete the exercises below.
-
Read
data/flights.rds
into R, naming the outputflights
. -
From
flights
, select the year, month, and day columns. -
From
flights
, select all columns except carrier. -
Sort
flights
by carrier. -
Sort
flight
by departure delay in descending order. -
Rename the tailnum column to tail_num.
-
Find the distinct origins.
-
Find the distinct combinations of origins and destinations.
-
Filter
flight
for the first day of January or February. -
Find all flights that meet these conditions:
- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United (UA), American (AA), or Delta (DL)
- Departed in summer (July, August, and September)
-
Count the flight destinations by origin.
-
Create a new column ‘speed’ that is distance divided by air time multiplied by 60.
-
Create a new column ‘flight_hours’ that is the air_time divided by 60.
-
In a pipeline (consecutive pipes), filter
flights
where neitherarr_delay
andtailnum
are not NA. Then count the destinations. -
In a pipeline (consecutive pipes), filter
flights
where the destination is “IAH”, then find the average arrival delays, grouped by year, month, and day. -
Which carrier has the highest average delays, both arrival and departure? Calculate both within summarize().
-
During which month is there the highest average departure delays? The lowest?
-
Read
data/airports.rds
into R, naming the outputairports
. -
Filter
airports
where ‘Intl’ is in the name, then rename thefaa
column todest
. Save the output to a data frame calledintl_airports
. -
In a single pipeline, select
sched_arr_time
,arr_delay
,dest
fromflights
, inner joinintl_airports
usingdest
as the ‘key’, and then count the airport names. Sort the output. Then use a left join instead of an inner join. What’s different? And why is it different? -
A pattern we haven’t seen yet is
group_by()
followed bymutate()
. Compare the two outputs below:
penguins %>%
group_by(island) %>%
summarize(mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE))
penguins %>%
group_by(island) %>%
mutate(mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE))
How are they different? How does the output change if we add
%>% ungroup()
to the end? Why might we want to add ungroup()
after
we’ve completed a grouping operation (it’s usually a good decision)?