-
Notifications
You must be signed in to change notification settings - Fork 51
/
block17_colorsGgplot2Qualitative.rmd
150 lines (110 loc) · 8.24 KB
/
block17_colorsGgplot2Qualitative.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
Taking control of qualitative colors in `ggplot2`
========================================================
```{r include = FALSE}
## I format my code intentionally!
## do not re-format it for me!
opts_chunk$set(tidy = FALSE)
## sometimes necessary until I can figure out why loaded packages are leaking
## from one file to another, e.g. from block91_latticeGraphics.rmd to this file
if(length(yo <- grep("gplots", search())) > 0) detach(pos = yo)
if(length(yo <- grep("gdata", search())) > 0) detach(pos = yo)
if(length(yo <- grep("gtools", search())) > 0) detach(pos = yo)
```
### Optional getting started advice
*Ignore if you don't need this bit of support.*
This is one in a series of tutorials in which we explore basic data import, exploration and much more using data from the [Gapminder project](http://www.gapminder.org). Now is the time to make sure you are working in the appropriate directory on your computer, perhaps through the use of an [RStudio project](block01_basicsWorkspaceWorkingDirProject.html). To ensure a clean slate, you may wish to clean out your workspace and restart R (both available from the RStudio Session menu, among other methods). Confirm that the new R process has the desired working directory, for example, with the `getwd()` command or by glancing at the top of RStudio's Console pane.
Open a new R script (in RStudio, File > New > R Script). Develop and run your code from there (recommended) or periodicially copy "good" commands from the history. In due course, save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and evoking "ggplot2" and "colors".
### Load the Gapminder data and `ggplot2`
Assuming the data can be found in the current working directory, this works:
```{r, eval=FALSE}
gDat <- read.delim("gapminderDataFiveYear.txt")
```
Plan B (I use here, because of where the source of this tutorial lives):
```{r}
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
```
Basic sanity check that the import has gone well:
```{r}
str(gDat)
```
Drop Oceania, which only has two countries:
```{r}
## drop Oceania
jDat <- droplevels(subset(gDat, continent != "Oceania"))
str(jDat)
```
Load `ggplot2`:
```{r}
library(ggplot2)
```
### Take control of the size and color of points
Let's use `ggplot2` to move towards the classic Gapminder bubble chart. Crawl then walk then run.
First, make a simple scatterplot for a single year.
```{r}
jYear <- 2007
q <- ggplot(subset(jDat, year == jYear),
aes(x = gdpPercap, y = lifeExp)) + scale_x_log10()
q + geom_point()
```
Take control of the plotting symbol, its size, and its color. Use obnoxious settings so that success versus failure is completely obvious. Now is not the time for the delicate operation of inserting your fancy color scheme. Be bold!
```{r}
## do I have control of size and fill color? YES!
q + geom_point(pch = 21, size = 8, fill = I("darkorchid1"))
```
### Circle area = population
We want the size of the circle to reflect population. Since we have direct control of the radius, we invert the relation $area = \pi r^2$ to determine the point size from the country's population. I have two complaints with my first attempt: the circles are still too small for my taste and I don't want the size legend. So in my second attempt, I suppress the legend with `show_guide = FALSE` and I increase the range of sizes by explicitly setting the range for the scale that maps $\sqrt(pop / \pi)$ into circle size.
```{r fig.show='hold', out.width='50%'}
q + geom_point(aes(size = sqrt(pop/pi)), pch = 21)
(r <- q +
geom_point(aes(size = sqrt(pop/pi)), pch = 21, show_guide = FALSE) +
scale_size_continuous(range=c(1,40)))
```
### Circle fill color determined by a factor
Now I use `aes()` to map a factor to color. For the moment, I settle for the `continent` factor and for the automatic color scheme. I also facet by continent. Why? Because it will be helpful below for checking my progress on using my custom color scheme. Since all the countries, say, in Europe, are some shade of green, if the continent facets have circles of many colors, I'll know something's wrong.
```{r fig.show='hold', out.width='50%'}
(r <- r + facet_wrap(~ continent))
r + aes(fill = continent)
```
### Get the color scheme for the countries
Elsewhere, I devised a color scheme for the Gapminder countries. We will not discuss it's construction here, but will merely pull it off the web. You can view it in PDF form [here](http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/figs/bryan-a01-colorScheme.pdf).
```{r}
## get the country color scheme
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderCountryColors.txt"
countryColors <- read.delim(file = gdURL, as.is = 3) # protect color
str(countryColors)
head(countryColors)
```
The data.frame `countryColors` has one row per country and three variables: country, continent, and color. The color variable holds the RGB hex strings encoding the color scheme.
__Note:__ The row order of `countryColors` is not alphabetical. The countries are actually sorted by size (in which particular year, I don't recall) within continent, reflecting the logic by which the scheme was created. No problem. Ideally, nothing in your analysis should depend on row order, although that's not always possible in reality.
### Prepare the color scheme for use with `ggplot2`
In the Grammar of Graphics, a __scale__ controls the mapping from a variable in the data to an aesthetic. So far we've let the coloring / filling scale be determined automatically by `ggplot2`. But to use our custom color scheme, we need to take control of the mapping of the `country` factor into fill color in `geom_point()`.
We will use `scale_fill_manual`, a member of a family of functions for customization of the discrete scales. The main argument is `values =`, which is a vector of aesthetic values -- fill colors, in our case. If this vector has names, they will be consulted during the mapping. This is incredibly useful! Below, we isolate the vector of hex strings providing the country colors and give this vector the country names as names. This saves us from any worry about the order of levels of the `country` factor, the row order of the data, or exactly which countries are being plotted.
```{r}
jColors <- countryColors$color
names(jColors) <- countryColors$country
head(jColors)
```
### Make the `ggplot2` bubble chart
This is deceptively simple at this point. Like many things, it looks really easy, once we figure everything out! The last two bits we add are to use `aes()` to specify that the country should be mapped to color and to use `scale_fill_manual()` to specify our custom color scheme.
```{r}
r + aes(fill = country) + scale_fill_manual(values = jColors)
```
### Epilogue: re-make the plot to reveal small countries
We know from earlier work with `lattice` that large countries can effectively hide the data from small countries, by covering them up. This is a case where, sadly, the row order of the data truly affects the visual output. `ggplot2` is no less vulnerable to this than `lattice` or base graphics here. So, to get closure, we sort the data on year and then on population and remake the plot, revealing all the code.
```{r}
jDat <- jDat[with(jDat, order(year, -1 * pop)), ]
ggplot(subset(jDat, year == jYear),
aes(x = gdpPercap, y = lifeExp)) + scale_x_log10() +
geom_point(aes(size = sqrt(pop/pi)), pch = 21, show_guide = FALSE) +
scale_size_continuous(range=c(1,40)) +
facet_wrap(~ continent) +
aes(fill = country) + scale_fill_manual(values = jColors)
```
### References
* Lattice: Multivariate Data Visualization with R [available via SpringerLink](http://ezproxy.library.ubc.ca/login?url=http://link.springer.com.ezproxy.library.ubc.ca/book/10.1007/978-0-387-75969-2/page/1) by Deepayan Sarkar, Springer (2008) | [all code from the book](http://lmdvr.r-forge.r-project.org/) | [GoogleBooks search](http://books.google.com/books?id=gXxKFWkE9h0C&lpg=PR2&dq=lattice%20sarkar%23v%3Donepage&pg=PR2#v=onepage&q=&f=false)
- especially Chapter 6 Scales, axes, and legends
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>