forked from jennybc/STAT545A_2013
-
Notifications
You must be signed in to change notification settings - Fork 0
/
block09_xyplotLattice.rmd
210 lines (167 loc) · 10.1 KB
/
block09_xyplotLattice.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
Scatterplots with `lattice`
========================================================
```{r include = FALSE}
## I format my code intentionally!
## do not re-format it for me!
opts_chunk$set(tidy = FALSE)
## sometimes necessary until I can figure out why loaded packages are leaking
## from one file to another, e.g. from block91_latticeGraphics.rmd to this file
if(length(yo <- grep("gplots", search())) > 0) detach(pos = yo)
if(length(yo <- grep("gdata", search())) > 0) detach(pos = yo)
if(length(yo <- grep("gtools", search())) > 0) detach(pos = yo)
```
We focus on studying the relationship between two quantitative variables -- possibly in conjunction with one or more categorical variables. We use the `xyplot()` function from the `lattice` package in this tutorial and will revisit this ground using `ggplot2` shortly.
### Optional getting started advice
*Ignore if you don't need this bit of support.*
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the [Gapminder project](http://www.gapminder.org). Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an [RStudio project](block01_basicsWorkspaceWorkingDirProject.html). To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the `getwd()` command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "scatter plots" and "lattice".
### Load the Gapminder data, drop Oceania, load packages
Assuming the data can be found in the current working directory, this works:
```{r eval=FALSE}
gDat <- read.delim("gapminderDataFiveYear.txt")
```
Plan B (I use here, because of where the source of this tutorial lives):
```{r}
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
```
Basic sanity check that the import has gone well:
```{r}
str(gDat)
```
Drop Oceania, which only has two countries
```{r}
## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
```
Load the `lattice` package.
```{r}
library(lattice)
```
### Show me the data with `xyplot()`
Get a scatterplot with `xyplot()`. Background grids are nice. In addition to `grid = TRUE`, `grid = "h"` and `grid = "v"` are other useful shortcuts.
```{r fig.show = 'hold', out.width = '49%'}
xyplot(lifeExp ~ gdpPercap, jDat)
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE)
```
Clearly we need to log transform the horizontal axis, i.e. the one for GDP per capita. You can take the logarithm directly in the formula, but it's sub-optimal due to the axis tick labels, which are not so easy to read. It's better to request the log transform via the `scales =` argument.
```{r fig.show='hold', out.width='50%'}
## log, the sub-optimal way
xyplot(lifeExp ~ log10(gdpPercap), jDat,
grid = TRUE)
## logging, better way ... step 1
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10)))
```
There are many other log axis labelling strategies. The next easiest thing is to specify `equispaced.log = FALSE` Even more is possible with functions in the `latticeExtra` package; see some examples in these [slides](http://www.stat.ubc.ca/~jenny/STAT545A/2012-lectures/cm08.pdf). I wish the tick marks and grid lined up, but I'm trying not to let the perfect be the enemy of the good. Let's press on.
```{r}
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)))
```
### `type =`
The `type =` argument can be used to enhance the figure with data-responsive elements. On the left, we specify the default value `type = "p"`, which requests only points, and on the right, we request a simple linear regression with `type = c("p", "r")`.
```{r fig.show='hold', out.width='50%'}
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = "p")
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "r"))
```
Let's make line that more obnoxious and more orange. And let's try a different overlay -- namely, a smooth regression via `type = c("p", "smooth")`. In this case the difference is pretty subtle.
```{r fig.show='hold', out.width='50%'}
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "r"), col.line = "darkorange", lwd = 3)
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), col.line = "darkorange", lwd = 3)
```
### `group =`
If you specify a grouping variable via `group =`, the data from the various groups will be superposed and visually distinguished. A simple key is obtained with `auto.key = TRUE`, but there are many alternative ways to customize the key. We won't fuss with that on this quick tour.
```{r fig.show='hold', out.width='50%'}
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
group = continent)
## auto.key
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
group = continent, auto.key = TRUE)
```
The `group =` argument can be used together with the `type =` argument.
```{r fig.show='hold', out.width='50%'}
## groups + type "smooth"
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
group = continent, auto.key = TRUE,
type = c("p", "smooth"), lwd = 4)
## making key more compact
xyplot(lifeExp ~ gdpPercap, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
group = continent, auto.key = list(columns = nlevels(jDat$continent)),
type = c("p", "smooth"), lwd = 4)
```
### Multi-panel conditioning
If you would rather see data for different groups in separate *panels*, instead of superposed, specify the grouping factor `z` like so: `y ~ x | z`. This is called multi-panel conditioning. Sometimes I specify the same factor as the conditioning and grouping variable, just because I like to keep the color scheme constant and visible.
```{r fig.show='hold', out.width='50%'}
xyplot(lifeExp ~ gdpPercap | continent, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)))
xyplot(lifeExp ~ gdpPercap | continent, jDat,
group = continent,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)))
```
You can use conditioning, grouping, and `type =` in various combinations.
```{r fig.show='hold', out.width='50%'}
## conditioning + type "r" or "smooth"
xyplot(lifeExp ~ gdpPercap | continent, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), col.line = "darkorange", lwd = 4)
xyplot(lifeExp ~ gdpPercap | continent, jDat,
grid = TRUE, group = continent,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), lwd = 4)
```
### Combat overplotting
In a univariate setting, we fought overplotting with jitter, before we turned to entirely different strategies, such as the kernel density estimation that underpins `densityplot()`. In scatterplots, the entry level solution to overplotting is to use alpha transparency for the points. I do that on the left via `alpha = 1/2`. Here's how to think of this: if alpha transparency is set to $1/k$ then an area becomes opaque when $k$ points overlap. (My `lattice` default specifies `pch = 16`, if you want to specify that in your call. Warning: not all graphics devices support alpha transparency.)
```{r fig.show='hold', out.width='50%'}
xyplot(lifeExp ~ gdpPercap | continent, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
type = c("p", "smooth"), lwd = 4, alpha = 1/2)
xyplot(lifeExp ~ gdpPercap | continent, jDat,
grid = TRUE,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
panel = panel.smoothScatter)
```
On the right I actually request a non-default panel function, `panel.smoothScatter()` that is, in fact, based on 2D kernel density estimation. There's more about panel functions [here](block10_latticeNittyGritty.html).
Finally, hexagonal binning is an interesting way to handle overplotting. The plane is divided into a hexagonal lattice and the individual hexagons are shaded according to the number of point that fall within. This requires an add-on library `hexbin` which you will need to install. Then you can use `hexbinplo()` as almost a drop-in substitute for `xyplot()`.
```{r}
#install.packages("hexbin", dependencies = TRUE)
library(hexbin)
hexbinplot(lifeExp ~ gdpPercap, jDat,
scales = list(x = list(log = 10, equispaced.log = FALSE)),
aspect = 1, bins=50)
```
And thus endeth the quick tour of `xyplot().`
### References
Lattice: Multivariate Data Visualization with R [available via SpringerLink](http://ezproxy.library.ubc.ca/login?url=http://link.springer.com.ezproxy.library.ubc.ca/book/10.1007/978-0-387-75969-2/page/1) by Deepayan Sarkar, Springer (2008) | [all code from the book](http://lmdvr.r-forge.r-project.org/) | [GoogleBooks search](http://books.google.com/books?id=gXxKFWkE9h0C&lpg=PR2&dq=lattice%20sarkar%23v%3Donepage&pg=PR2#v=onepage&q=&f=false)
* Ch. 5 Scatter Plots and Extensions is most relevant to this tutorial
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>