GitHub - SFOEWD/data-academy-intro-to-r: Materials for Data Academy's Introduction to R

This repository contains the course materials for Data Academy’s Introduction to R course.

Setup

Clone/download the repository onto your computer. Then open data-academy-intro-to-r.Rproj in RStudio (should just be able to double-click the file).
Open R/install_packages.R and click ‘Source’. This should install the required packages for the workshop.
During the workshop, open R/script.R and follow along, executing each line of code.
After the workshop, complete the lab exercises below in a new R script.

Lab

In a new script, load the tidyverse and complete the exercises below.

Read data/flights.rds into R, naming the output flights.
From flights, select the year, month, and day columns.
From flights, select all columns except carrier.
Sort flights by carrier.
Sort flight by departure delay in descending order.
Rename the tailnum column to tail_num.
Find the distinct origins.
Find the distinct combinations of origins and destinations.
Filter flight for the first day of January or February.
Find all flights that meet these conditions:

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United (UA), American (AA), or Delta (DL)
Departed in summer (July, August, and September)

Count the flight destinations by origin.
Create a new column ‘speed’ that is distance divided by air time multiplied by 60.
Create a new column ‘flight_hours’ that is the air_time divided by 60.
In a pipeline (consecutive pipes), filter flights where neither arr_delay and tailnum are not NA. Then count the destinations.
In a pipeline (consecutive pipes), filter flights where the destination is “IAH”, then find the average arrival delays, grouped by year, month, and day.
Which carrier has the highest average delays, both arrival and departure? Calculate both within summarize().
During which month is there the highest average departure delays? The lowest?
Read data/airports.rds into R, naming the output airports.
Filter airports where ‘Intl’ is in the name, then rename the faa column to dest. Save the output to a data frame called intl_airports.
In a single pipeline, select sched_arr_time, arr_delay, dest from flights, inner join intl_airports using dest as the ‘key’, and then count the airport names. Sort the output. Then use a left join instead of an inner join. What’s different? And why is it different?
A pattern we haven’t seen yet is group_by() followed by mutate(). Compare the two outputs below:

penguins %>% 
  group_by(island) %>% 
  summarize(mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE))

penguins %>% 
  group_by(island) %>% 
  mutate(mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE))

How are they different? How does the output change if we add %>% ungroup() to the end? Why might we want to add ungroup() after we’ve completed a grouping operation (it’s usually a good decision)?

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
R		R
data		data
.gitignore		.gitignore
README.Rmd		README.Rmd
README.md		README.md
data-academy-intro-to-r.Rproj		data-academy-intro-to-r.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Lab

About

Releases

Packages

Languages

SFOEWD/data-academy-intro-to-r

Folders and files

Latest commit

History

Repository files navigation

Setup

Lab

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages