-
Notifications
You must be signed in to change notification settings - Fork 491
/
00-preface.Rmd
271 lines (165 loc) · 27 KB
/
00-preface.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# Preface {-}
\vspace{0.1in}
<center>
`r include_image("images/logos/Rlogo.png", html_opts = "height=100px", latex_opts = "height=10%")` \hfill       `r include_image("images/logos/RStudio-Logo-Blue-Gradient.png", html_opts = "height=100px", latex_opts = "height=10%")`
</center>
**Help! I'm completely new to coding and I need to learn R and RStudio! What do I do?**
\vspace{0.1in}
If you're asking yourself this question, then you've come to the right place! Start with the "Introduction for students" section.
* *Are you an instructor hoping to use this book in your courses? We recommend reading the "Introduction for students" section first. Then, read the "Introduction for instructors" section for more information on how to teach with this book.*
* *Are you looking to connect with and contribute to* ModernDive*? Then, read the "Connect and contribute" section for information on how.*
* *Are you curious about the publishing of this book? Then, read the "About this book" section for more information on the open-source technology, in particular R Markdown and the bookdown package.*
```{r results="asis", echo=FALSE, purl=FALSE}
if (is_html_output()) {
cat(glue::glue('This is version {version} of *ModernDive* published on {date}. For previous versions of *ModernDive*, see the "About this book" section below.'))
}
```
## Introduction for students {-}
This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding experience. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.
We present a map of your upcoming journey in Figure \@ref(fig:moderndive-figure).
(ref:flowchart) *ModernDive* flowchart.
```{r moderndive-figure, fig.align="center", fig.cap="(ref:flowchart)", echo=FALSE, purl=FALSE}
include_graphics("images/flowcharts/flowchart/flowchart.002.png")
```
You'll first get started with data in Chapter \@ref(getting-started) where you'll learn about the difference between R and RStudio, start coding in R, install and load your first R packages, and explore your first dataset: all domestic departure `flights` from a New York City airport in 2013. Then you'll cover the following three portions of this book (Parts 2 and 4 are combined into a single portion):
1. Data science with `tidyverse`. You'll assemble your data science toolbox using `tidyverse` packages. In particular, you'll
+ Ch.\@ref(viz): Visualize data using the `ggplot2` package.
+ Ch.\@ref(wrangling): Wrangle data using the `dplyr` package.
+ Ch.\@ref(tidy): Learn about the concept of "tidy" data as a standardized data input and output format for all packages in the `tidyverse`. Furthermore, you'll learn how to import spreadsheet files into R using the `readr` package.
1. Data modeling with `moderndive`. Using these data science tools and helper functions from the `moderndive` package, you'll fit your first data models. In particular, you'll
+ Ch.\@ref(regression): Discover basic regression models with only one explanatory variable.
+ Ch.\@ref(multiple-regression): Examine multiple regression models with more than one explanatory variable.
1. Statistical inference with `infer`. Once again using your newly acquired data science tools, you'll unpack statistical inference using the `infer` package. In particular, you'll:
+ Ch.\@ref(sampling): Learn about the role that sampling variability plays in statistical inference and the role that sample size plays in this sampling variability.
+ Ch.\@ref(confidence-intervals): Construct confidence intervals using bootstrapping.
+ Ch.\@ref(hypothesis-testing): Conduct hypothesis tests using permutation.
1. Data modeling with `moderndive` (revisited): Armed with your understanding of statistical inference, you'll revisit and review the models you'll construct in Ch.\@ref(regression) and Ch.\@ref(multiple-regression). In particular, you'll:
+ Ch.\@ref(inference-for-regression): Interpret confidence intervals and hypothesis tests in a regression setting.
We'll end with a discussion on what it means to "tell your story with data" in Chapter \@ref(thinking-with-data) by presenting example case studies.^[Note that you'll see different versions of the word "ModernDive" in this book:
(1) `moderndive` refers to the R package.
(2) *ModernDive* is an abbreviated version of *Statistical Inference via Data Science: A ModernDive into R and the Tidyverse*. It's essentially a nickname we gave the book.
(3) ModernDive (without italics) corresponds to both the book and the corresponding R package together as an entity.]
### What we hope you will learn from this book {-}
We hope that by the end of this book, you'll have learned how to:
1. Use R and the `tidyverse` suite of R *packages* for data science.
1. Fit your first *models* to data, using a method known as *linear regression*.
1. Perform *statistical inference* using *sampling*, *confidence intervals*, and *hypothesis tests*.
1. *Tell your story with data* using these tools.
What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion. Further discussions on data stories can be found in the blog post ["Tell a Meaningful Story With Data."](https://www.thinkwithgoogle.com/marketing-resources/data-measurement/tell-meaningful-stories-with-data/)
Over the course of this book, you will develop your "data science toolbox," equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.
In particular, this book will lean heavily on data visualization. In today's world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we'll use visualization as a way of building almost all of the ideas in this book.
To impart the statistical lessons of this book, we have intentionally minimized the number of mathematical formulas used. Instead, you'll develop a conceptual understanding of statistics using data visualization and computer simulations. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past and how it is commonly perceived.
Finally, you'll learn the importance of literate programming. \index{literate programming} By this we mean you'll learn how to write code that is useful not just for a computer to execute, but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see the "Reproducible research" subsection in this Preface for more details). Hal Abelson \index{Abelson, Hal} coined the phrase that we will follow throughout this book:
> Programs must be written for people to read, and only incidentally for machines to execute.
We understand that there may be challenging moments as you learn to program. Both of us continue to struggle and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run though, we all can solve problems faster and more elegantly via programming. We wrote this book as our way to help you get started and you should know that there is a huge community of R users that are happy to help everyone along as well. This community exists in particular on the internet on various forums and websites such as [stackoverflow.com](https://stackoverflow.com/).
### Data/science pipeline {-}
You may think of statistics as just being a bunch of numbers. We commonly hear the phrase "statistician" when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences. \index{statistics}
You'll commonly hear the phrase "statistically significant" thrown around in the media. You'll see articles that say, "Science now shows that chocolate is good for you." Underpinning these claims is data analysis. \index{data analysis} By the end of this book, you'll be able to better understand whether these claims should be trusted or whether we should be wary. Inside data analysis are many sub-fields that we will discuss throughout this book (though not necessarily in this order):
- data collection
- data wrangling
- data visualization
- data modeling
- inference
- correlation and regression
- interpretation of results
- data communication/storytelling
These sub-fields are summarized in what Grolemund \index{Grolemund, Garrett} and Wickham \index{Wickham, Hadley} have previously termed the ["data/science pipeline"](http://r4ds.had.co.nz/explore-intro.html) in Figure \@ref(fig:pipeline-figure).
```{r pipeline-figure, fig.align="center", fig.cap="Data/science pipeline.", echo=FALSE, purl=FALSE, out.height="100%", out.width="100%"}
include_graphics("images/r4ds/data_science_pipeline.png")
```
We will begin by digging into the grey **Understand** portion of the cycle with data visualization, then with a discussion on what is meant by tidy data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via **Communication**. These steps are vital to any statistical analysis. But, why should you care about statistics?
There's a reason that many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn't be intimidated by statistics. It's not the beast that it used to be and, paired with computation, you'll see how reproducible research in the sciences particularly increases scientific knowledge.
### Reproducible research {-}
> The most important tool is the _mindset_, when starting, that the end product will be reproducible. – Keith Baggerly
\index{Baggerly, Keith}
Another goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we'll be trying to help you build new habits. This will take practice and be difficult at times. You'll see just why it is so important for you to keep track of your code and document it well to help yourself later and any potential collaborators as well.
Copying and pasting results from one program into a word processor is not an ideal way to conduct efficient and effective scientific research. It's much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs.
In traditional analyses, if an error was made with the original data, we'd need to step through the entire process again: recreate the plots and copy-and-paste all of the new plots and our statistical analysis into our document. This is error prone and a frustrating use of time. We want to help you to get away from this tedious activity so that we can spend more time doing science.
> We are talking about _computational_ reproducibility. – Yihui Xie
\index{Xie, Yihui}
Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as **computational reproducibility**. \index{computational reproducibility} This refers to being able to pass all of one's data analysis, datasets, and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent interpreting results and considering assumptions instead of the more error prone way of starting from scratch or following a list of steps that may be different from machine to machine.
### Final note for students {-}
At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book's construction and publishing, then continue with the rest of the chapter. Otherwise, let's get started with R and RStudio in Chapter \@ref(getting-started)!
## Introduction for instructors {-}
### Resources {-}
Here are some resources to help you use *ModernDive*:
1. We've included review questions posed as *Learning checks*. You can find all the solutions to all *Learning checks* in Appendix D of the online version of the book at <https://moderndive.com/D-appendixD.html>.
1. Dr.\ Jenny Smetzer and Albert Y. Kim have written a series of labs and problem sets. You can find them at <https://moderndive.com/labs>.
1. You can see the webpages for two courses that use *ModernDive*:
+ Smith College "SDS192 Introduction to Data Science": <https://rudeboybert.github.io/SDS192/>.
+ Smith College "SDS220 Introduction to Probability and Statistics": <https://rudeboybert.github.io/SDS220/>.
### Why did we write this book? {-}
This book is inspired by
- *Mathematical Statistics with Resampling and R* [@hester2011]
- *OpenIntro: Intro Stat with Randomization and Simulation* [@isrs2014]
- *R for Data Science* [@rds2016]
The first book, designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to impart statistical concepts like sampling distributions using computation instead of large-sample approximations and other mathematical formulas. The last two books are free options for learning about introductory statistics and data science, providing an alternative to the many traditionally expensive introductory statistics textbooks.
When looking over the introductory statistics textbooks that currently exist, we found there wasn't one that incorporated many newly developed R packages directly into the text, in particular the many packages included in the [`tidyverse`](http://tidyverse.org/) set of packages, such as `ggplot2`, `dplyr`, `tidyr`, and `readr` that will be the focus of this book's first part on "Data Science with `tidyverse`."
Additionally, there wasn't an open-source and easily reproducible textbook available that exposed new learners to all four of the learning goals we listed in the "Introduction for students" subsection. We wanted to write a book that could develop theory via computational techniques and help novices master the R language in doing so.
### Who is this book for? {-}
This book is intended for instructors of traditional introductory statistics classes using RStudio, who would like to inject more data science topics into their syllabus. RStudio can be used in either the server version or the desktop version. (This is discussed further in Subsection \@ref(installing).) We assume that students taking the class will have no prior algebra, no calculus, nor programming/coding experience.
Here are some principles and beliefs we kept in mind while writing this text. If you agree with them, this is the book for you.
1. **Blur the lines between lecture and lab**
+ With increased availability and accessibility of laptops and open-source non-proprietary statistical software, the strict dichotomy between lab and lecture can be loosened.
+ It's much harder for students to understand the importance of using software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the grammar rules. Frequent reinforcement is key.
1. **Focus on the entire data/science research pipeline**
+ We believe that the entirety of Grolemund and Wickham's [data/science pipeline](http://r4ds.had.co.nz/introduction.html) \index{data science pipeline} as seen in Figure \@ref(fig:pipeline-figure) should be taught.
+ We heed George Cobb's call to ["minimize prerequisites to research"](https://arxiv.org/abs/1507.05346): \index{Cobb, George} students should be answering questions with data as soon as possible.
1. **It's all about the data**
+ We leverage R packages for rich, real, and realistic datasets that at the same time are easy-to-load into R, such as the `nycflights13` and `fivethirtyeight` packages.
+ We believe that [data visualization is a "gateway drug" for statistics](http://escholarship.org/uc/item/84v3774z) and that the grammar of graphics as implemented in the `ggplot2` package is the best way to impart such lessons. However, we often hear: "You can't teach `ggplot2` for data visualization in intro stats!" We, like \index{Robinson, David} [David Robinson](http://varianceexplained.org/r/teach_ggplot2_to_beginners/), are much more optimistic and have found our students have been largely successful in learning it.
+ `dplyr` has made data wrangling much more [accessible](http://chance.amstat.org/2015/04/setting-the-stage/) to novices, and hence much more interesting datasets can be explored.
1. **Use simulation/resampling to introduce statistical inference, not probability/mathematical formulas**
+ Instead of using formulas, large-sample approximations, and probability tables, we teach statistical concepts using simulation-based inference.
+ This allows for a de-emphasis of traditional probability topics, freeing up room in the syllabus for other topics. Bridges to these mathematical concepts are given as well to help with relation of these traditional topics with more modern approaches.
1. **Don't fence off students from the computation pool, throw them in!**
+ Computing skills are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice.
+ We are not teaching a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis.
1. **Complete reproducibility and customizability**
+ We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the whole book! While we have made choices to occasionally hide the code that produces more complicated figures, reviewing the book's GitHub repository will provide you with all the code (see below).
+ Ultimately the best textbook is one you've written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. We encourage you to take what we've provided and make it work for your own needs. For more about how to make this book your own, see "About this book" later in this Preface.
## Connect and contribute {-}
If you would like to connect with ModernDive, check out the following links:
* If you would like to receive periodic updates about ModernDive (roughly every 6 months), please sign up for our [mailing list](http://eepurl.com/cBkItf).
* Contact Albert at `[email protected]` and Chester at `[email protected]`.
* We're on Twitter at <https://twitter.com/ModernDive>.
If you would like to contribute to *ModernDive*, there are many ways! We would love your help and feedback to make this book as great as possible! For example, if you find any errors, typos, or areas for improvement, then please email us or post an issue on our [GitHub issues](https://github.com/moderndive/moderndive_book/issues) \index{GitHub issues} page. If you are familiar with GitHub and would like to contribute, see the "About this book" section.
## Acknowledgements {-}
The authors would like to thank [Nina Sonneborn](https://github.com/nsonneborn), [Dr.\ Alison Hill](https://alison.rbind.io/), [Kristin Bott](https://twitter.com/rhobott?lang=en), Dr.\ Jenny Smetzer, [Prof. Katherine Kinnaird](https://www.smith.edu/academics/faculty/katherine-kinnaird), and the participants of our [2017](https://www.causeweb.org/cause/uscots/uscots17/workshop/3) and [2019](https://www.causeweb.org/cause/uscots/uscots19/workshop/4) USCOTS workshops for their feedback and suggestions. We'd also like to thank [Dr.\ Andrew Heiss](https://twitter.com/andrewheiss) for contributing nearly all of Subsection \@ref(tips-code) on "Errors, warnings, and messages," [Evgeni Chasnovski](https://github.com/echasnovski) for creating the new `geom_parallel_slopes()` extension to the `ggplot2` package for plotting parallel slopes models, and Smith College Statistical & Data Sciences students [Starry Zhou](https://github.com/Starryz) and [Marium Tapal](https://github.com/mariumtapal) for their many edits to the book. A special thanks goes to Dr.\ Jude Weinstein-Jones, co-founder of [The Learning Scientists](https://www.learningscientists.org), for their extensive feedback.
We were both honored to have [Dr.\ Kelly S. McConville](https://mcconville.rbind.io/) write the [Foreword](#foreword) of the book. Dr.\ McConville is a pioneer in statistics education and was a source of great inspiration to both of us as we continued to update the book to get it to its current form. Thanks additionally to the [continued contributions by members of the community](https://github.com/moderndive/ModernDive_book/graphs/contributors) to the book on GitHub and to the many individuals that have recommended this book to others. We are so very appreciative of all of you!
Lastly, a very special shout out to any student who has ever taken a class with us at either Pacific University, Reed College, Middlebury College, Amherst College, or Smith College. We couldn't have made this book without you!
## About this book {-}
This book was written using RStudio's [bookdown](https://bookdown.org/) \index{bookdown} package by Yihui Xie \index{Xie, Yihui} [@R-bookdown]. This package simplifies the publishing of books by having all content written in \index{R Markdown} [R Markdown](http://rmarkdown.rstudio.com/html_document_format.html). The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub:
* **Latest online version** The most up-to-date release:
+ Version `r latest_release_version` released on `r latest_release_date` ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v`r latest_release_version`))
+ Available at <https://moderndive.com/>
* **Print edition** The CRC Press [print edition](https://www.routledge.com/Statistical-Inference-via-Data-Science-A-ModernDive-into-R-and-the-Tidyverse/Ismay-Kim/p/book/9780367409821) of *ModernDive* corresponds to Version 1.1.0 (with some typos fixed).
* **Previous online versions** Older versions that may be out of date:
+ [Version 1.0.0](previous_versions/v1.0.0/index.html) released on November 25, 2019 ([source code](https://github.com/moderndive/ModernDive_book/releases/tag/v1.0.0))
+ [Version 0.6.1](previous_versions/v0.6.1/index.html) released on August 28, 2019 ([source code](https://github.com/moderndive/ModernDive_book/releases/tag/v0.6.1))
+ [Version 0.6.0](previous_versions/v0.6.0/index.html) released on August 7, 2019 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.6.0))
+ [Version 0.5.0](previous_versions/v0.5.0/index.html) released on February 24, 2019 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.5.0))
+ [Version 0.4.0](previous_versions/v0.4.0/index.html) released on July 21, 2018 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.4.0))
+ [Version 0.3.0](previous_versions/v0.3.0/index.html) released on February 3, 2018 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.3.0))
+ [Version 0.2.0](previous_versions/v0.2.0/index.html) released on August 2, 2017 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.2.0))
+ [Version 0.1.3](previous_versions/v0.1.3/index.html) released on February 9, 2017 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.1.3))
+ [Version 0.1.2](previous_versions/v0.1.2/index.html) released on January 22, 2017 ([source code](https://github.com/moderndive/moderndive_book/releases/tag/v0.1.2))
Could this be a new paradigm for textbooks? Instead of the traditional model of textbook companies publishing updated *editions* of the textbook every few years, we apply a software design influenced model of publishing more easily updated *versions*. We can then leverage open-source communities of instructors and developers for ideas, tools, resources, and feedback. As such, we welcome your GitHub pull requests.
Finally, since this book is under a [Creative Commons Attribution - NonCommercial - ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/), feel free to modify the book as you wish for your own non-commercial needs, but please list the authors at the top of `index.Rmd` as: "Chester Ismay, Albert Y. Kim, and YOU!"
# About the authors {-}
Chester Ismay | Albert Y. Kim
:-------------------------:|:-------------------------:
`r include_image(path = "images/ismay.png", html_opts = "height=220px", latex_opts = "width=46%")` | `r include_image(path = "images/kim.png", html_opts = "height=220px", latex_opts = "width=46%")`
**Chester Ismay** is Vice President of Data and Automation at MATE Seminars and is a freelance data science consultant and instructor. He also teaches in the Center for Executive and Professional Education at Portland State University. He completed his PhD in statistics from Arizona State University in 2013. He has previously worked in a variety of roles including as an actuary at Scottsdale Insurance Company (now Nationwide E&S/Specialty) and at Ripon College, Reed College, and Pacific University. He has experience working in online education and was previously a Data Science Evangelist at DataRobot, where he led data science, machine learning, and data engineering in-person and virtual workshops for DataRobot University. In addition to his work for *ModernDive*, he also contributed as initial developer of the [`infer`](https://cran.r-project.org/package=infer) R package and is author and maintainer of the [`thesisdown`](https://github.com/ismayc/thesisdown) R package.
* Email: `[email protected]`
* Webpage: <https://chester.rbind.io/>
* Twitter: [old_man_chester](https://twitter.com/old_man_chester)
* GitHub: <https://github.com/ismayc>
**Albert Y.\ Kim** is an Assistant Professor of Statistical & Data Sciences at Smith College in Northampton, MA, USA. He completed his PhD in statistics at the University of Washington in 2011. Previously he worked in the Search Ads Metrics Team at Google Inc.\ as well as at Reed, Middlebury, and Amherst Colleges. In addition to his work for *ModernDive*, he is a co-author of the [`resampledata`](https://cran.r-project.org/package=resampledata) and [`SpatialEpi`](https://cran.r-project.org/package=SpatialEpi) R packages.
* Email: `[email protected]`
* Webpage: <https://rudeboybert.rbind.io/>
* Twitter: [rudeboybert](https://twitter.com/rudeboybert)
* GitHub: <https://github.com/rudeboybert>
Both Drs. Ismay and Kim, along with [Jennifer Chunn](https://github.com/jchunn), are co-authors of the [`fivethirtyeight`](https://fivethirtyeight-r.netlify.app/) package of code and datasets published by the data journalism website [FiveThirtyEight.com](https://fivethirtyeight.com/).
<!-- For use only in PDF, is skipped in HTML -->
\mainmatter