-
Notifications
You must be signed in to change notification settings - Fork 51
/
hw03_dataAggregation.rmd
362 lines (290 loc) · 23.5 KB
/
hw03_dataAggregation.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
Go back to [STAT545A home](current.html)
Homework #3 Data aggregation
========================================================
> Follow the [existing homework submission instructions](hw00_instructions.html) UP TO THE LINK SUBMISSION PART. Submit your links by editing this Google doc -- WARNING: I turned off sharing ~3pm Mon Sept 23 because almost everyone had submitted their links. Contact JB if you need to submit or edit a link.
* Start with the Gapminder data as provided in [`gapminderDataFiveYear.txt`](http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt).
* Use the `plyr` package to tackle __some, not all,__ tasks from the list below. __You decide what to work on__ but spend ~2 hours on this. Of course you are welcome to do more if you wish.
* Write up with R Markdown.
- Include a narrative, written in English prose, i.e. don't just show code and results
- Expose your code, i.e. use `echo = FALSE` very sparingly
- If you tackle a tricky one, first give a clean account of your data aggregation results. You can describe your pain and suffering at the end.
- Don't use figures! It's an artificial constraint, I know. Why hold back? Focus on data aggregation, we'll talk figures next week and revisit in the next homework. Plus you will get a visceral understanding of how ridiculous it is to do exploration *without* making lots of figures.
* DUE: Before class begins 9:30am Monday September 23.
- Compile into an HTML report. Follow the naming convention. Publish the HTML report on the web somewhere, such as on RPubs. Make the slug follow the naming convention.
- Publish the R Markdown file as a Gist. Follow the naming convention.
- Share your links by editing the Google doc mentioned above.
Tips
* Perform a superficial check that data import went OK.
* Give informative, short variable names.
* Attend to row and column order in reported results.
* Don't be satisfied merely with correct results; examine your code and see if it can be simpler, more readable, more general.
### Menu of data aggregation tasks.
```{r include = FALSE}
## I format my code intentionally!
## do not re-format it for me!
opts_chunk$set(tidy = FALSE)
## toggle to turn solution chunks on/off
## Assign HW: This file would live in private/ and not be in public repo and
## would not compiled by fileherd. To make student-facing assignment, uncomment
## the line below, knit this file, and copy the resulting .md file up one level,
## where fileherd will pick it up and compile to HTML and we can push to web.
## Only the data aggregation task prompts will appear.
## Reveal solutions: Move this file up into main directory. Leave the line below
## commented out. fileherd will compile questions AND solutions normally.
## Reset course for future: Hmmmm.... now the solutions are "out there on the
## web". I probably should have thought of that sooner. Worry about that later.
#opts_chunk$set(eval = FALSE, echo = FALSE)
```
#### Load the Gapminder data and the `plyr` and `xtable` packages
```{r}
## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
str(gDat)
## define a function for converting and printing to HTML table
htmlPrint <- function(x, ...,
digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = 'html',
include.rownames = include.rownames, ...)
}
```
> I wrote a function `htmlPrint()` to create an HTML table and print it.
#### Create your own adventure.
You are welcome to create your own exercises by riffing on what you see below. In fact, that's great -- you won't be copying from your office mate and it emulates real life where you first need to think up interesting questions, then solve them. The easiest sort of variation is to do what I ask below but with a different variable, different summary statistic, different tuning parameter, etc.
#### Get the maximum and minimum of GDP per capita for all continents in a "wide" format. Easy.
Produce a data.frame with 3 variables: a continent factor, the minimum GDP per capita, the maximum GDP per capita. There will be one row per continent. Sort on either the minimum or the maximum and assess whether the continents rank similarly w/r/t the other extreme. Discuss.
```{r results = 'asis'}
foo <- ddply(gDat, ~ continent, summarize,
minGdpPercap = min(gdpPercap), maxGdpPercap = max(gdpPercap))
htmlPrint(arrange(foo, minGdpPercap))
```
#### Get the maximum and minimum of GDP per capita for all continents in a "tall" format. Medium/hard.
Produce a data.frame with 3 variables: a continent factor, a GDP per capita summary statistic, a factor that describes the statistic (i.e. "min" vs "max"). There will be one row per continent * statistic combination. *Don't use reshaping -- get this directly with data aggregation.*
```{r results = 'asis'}
foo <- ddply(gDat, ~ continent, function(x) {
gdpPercap <- range(x$gdpPercap)
return(data.frame(gdpPercap, stat = c("min", "max")))
})
htmlPrint(foo)
```
#### Look at the spread of GDP per capita within the continents. Easy.
Measures of spread (we will discuss in class soon, just try them out!):
* standard deviation or variance -- see `sd()` and `var()`
* median absolute deviation -- see `mad()`
* interquartile range -- see `IQR()`
Produce a data.frame with one row per continent and variables containing the different measures of spread for GDP/capita. Sort on one of those and assess whether the continents rank similarly w/r/t heterogeneity of GDP/capita by the other measures. Discuss.
```{r results = 'asis'}
foo <- ddply(gDat, ~ continent, summarize,
sdGdpPercap = sd(gdpPercap), madGdpPercap = mad(gdpPercap),
iqrGdpPercap = IQR(gdpPercap))
htmlPrint(arrange(foo, sdGdpPercap))
## spot checks!
IQR(gDat$gdpPercap[gDat$continent == "Europe"])
mad(gDat$gdpPercap[gDat$continent == "Asia"])
```
> If you don't get an error but you are still worried you aren't doing the right computation, pick a few cases in the middle at random and check "by hand".
#### Compute a trimmed mean of life expectancy for different years. Easy/medium.
Compute the average life expectancy by year, as a warm-up exercise. Then consider an alternative: a [*trimmed mean*](http://en.wikipedia.org/wiki/Truncated_mean). There is a `trim` argument in the built-in function `mean()`. Pick a modest trim fraction that is greater than 0. Report your sorted results. Make sure your trim level is well-documented.
```{r results = 'asis'}
jTrim <- 0.2
foo <- ddply(gDat, ~ year, summarize, tMean = mean(lifeExp, trim = jTrim))
htmlPrint(arrange(foo, tMean))
```
Hear ye, hear ye: the trim level for the trimmed means above is `r jTrim`. Let it be known that I did not type that number in; rather it was printed from the value assigned in the previous R chunk.
> If you haven't used it yet, you should experiment with the backtick method of including inline R code. Look at [the R Markdown file](https://github.com/jennybc/STAT545A/blob/master/hw03_dataAggregation.rmd) to see how I do this. If I change `jTrim` to a different value in the R code, my prose will automatically update.
<!--- TO DO: make the variable name reflect trim level, i.e. tMean0.2 instead of tMean. --->
#### How is life expectancy changing over time on different continents? "Tall" format. Easy/medium.
Get a data.frame with 3 variables: continent, year, and mean or median life expectancy. What are the trends over time? Are they the same or different across the continents? Is this easy to assess from this data.frame?
```{r results = 'asis'}
htmlPrint(ddply(gDat, ~ continent + year,
summarize, medLifeExp = median(lifeExp)))
```
#### How is life expectancy changing over time on different continents? "Wide" format. Hard (or possibly silly).
Get a data.frame with 1 + 5 = 6 variables: year and then one variable per continent, giving mean or median life expectancy. Is there a natural way to do this in `plyr` without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to `plyr`, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. *Hint: I've made some progress here and advise you look into `daply()`.*
```{r results = 'asis'}
foo <- daply(gDat, ~ year + continent,
summarize, medLifeExp = median(lifeExp))
#str(foo)
foo <- as.data.frame(foo)
#str(foo) ## there's still some goofiness, ie the variables are lists .... hmmmm
htmlPrint(foo, include.rownames = TRUE)
```
> Arguably, this table is better transposed. So just switch the order of `continent` and `year` in the `plyr` call.
```{r results = 'asis'}
foo <- daply(gDat, ~ continent + year,
summarize, medLifeExp = median(lifeExp))
htmlPrint(as.data.frame(foo), include.rownames = TRUE)
```
#### Count the number of countries with low life expectancy over time by continent. "Tall" format. Medium/hard.
Compute some measure of worldwide life expectancy -- you decide -- a mean or median or some other quantile or perhaps your current age. The determine how many countries on each continent have a life expectancy less than this benchmark, for each year. Produce a data.frame with 3 variables: continent, year, and a country count. What's the trend over time? Is that trend the same for all continents? How much fun is it to assess this from this tall table?
```{r results = 'asis'}
#(bMark <- quantile(gDat$lifeExp, 0.2))
bMark <- 43 # JB wrote this on her 43rd b'day!
htmlPrint(ddply(gDat, ~ continent + year,
function(x) c(lowLifeExp = sum(x$lifeExp <= bMark))))
```
#### Compute the proportion of countries with low life expectancy over time by continent. Medium extension of another task.
The continents have very different numbers of countries. Maybe the proportion of "low life expectancy" countries is more relevant than the raw count. Or maybe the final table should report both? Work on that.
```{r results = 'asis'}
## note: bMark set in previous chunk
htmlPrint(ddply(gDat, ~ continent + year, function(x) {
jCount = sum(x$lifeExp <= bMark)
c(count = jCount, prop = jCount / nrow(x))
}), digits = c(0, 0, 0, 0, 2))
```
#### Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. "Wide" format. Extension of another task of unknown difficulty / value.
Create a table with one row per year and one column -- at least conceptually -- per continent. In that "column" report the absolute and/or relative abundance of "low life expectancy" countries. Is there a natural way to do this in `plyr` without hard-wiring the continent-specific variables and without reshaping an intermediate result? Since I am new-ish to `plyr`, I don't know yet if this is possible and/or a totally stupid way to go about this. So this one is for people who are happy to play around a bit. *Hint: I've made some progress here and advise you look into `daply()`. I ended up skipping straight to character output versus an object appropriate for further computation or graphing. Sometimes it's better to approach these ill-defined tasks from a different angle.*
```{r results='asis'}
## note: bMark set in a previous chunk
## my first success ... but was still a 3-dim array :(
## that would be useful for plotting, at least after some reshaping, but is not
## so easy on the eyes
# daply(gDat, ~ year + continent, function(x) {
# jCount <- sum(x$lifeExp < bMark)
# c(count = jCount, prop = jCount / nrow(x))
# })
foo <- daply(gDat, ~ year + continent, function(x) {
jCount <- sum(x$lifeExp <= bMark)
jTotal <- nrow(x)
jProp <- jCount / jTotal
return(sprintf("%1.2f (%d/%d)", jProp, jCount, jTotal))
})
htmlPrint(foo, include.rownames = TRUE)
```
#### Find countries with interesting stories. Open-ended and, therefore, hard. Realistic!
It was easy (see above) to get the minimum life expectancy seen on each continent by year. Now try to report the year, the continent, the minimum (or max or whatever) life expectancy AND the country it pertains to. You could do something similar with GDP per capita or population (though result likely to be boring with population). *Hint: read up on `which.min()` and `which.max()`.*
```{r results = 'asis'}
## this works but produces a frustrating long/tall result; not good for printing
# ddply(gDat, ~ year + continent, function(x) {
# theMin <- which.min(x$lifeExp)
# x[theMin, c("country", "year", "continent", "lifeExp")]
# })
foo <- daply(gDat, ~ year + continent, function(x) {
theMin <- which.min(x$lifeExp)
return(sprintf("%1.0f %s", x$lifeExp[theMin], x$country[theMin]))
})
htmlPrint(foo, include.rownames = TRUE)
```
<!--- normal rules for padding in sprintf() achieved nothing; I guess due to what happens in an HTML table? as in, spaces mean nothing? is there any way to protect them, I wonder; if I pursue, here's the type of sprint() call I wanted: sprintf("%-20s - %1.0f", "canada", 22) --->
Consider the linear regression we fit in tutorial of life expectancy vs. time. Find the min (and/or max) of the slope (and/or the intercept) within each continent. Report these interesting countries and some info about them in a data.frame. This might work for other variables too.
```{r results = 'asis'}
yearMin <- min(gDat$year)
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
jCoefs <- ddply(gDat, ~ country + continent, jFun)
## giving out the Most Improved Award!
## going after maximum slope within continent
foo <- ddply(jCoefs, ~ continent, function(x) {
theMax <- which.max(x$slope)
x[theMax, c("continent", "country", "intercept", "slope")]
})
htmlPrint(foo, digits = c(0, 0, 0, 0, 2))
```
Sudden, substantial departures from the temporal trend is interesting. This goes for life expectancy, GDP per capita, or population. How might one operationalize this notion of "interesting"?
* Fit a regression of the response vs. time. Consider the residuals. Determine if there are 1 or more freakishly large residuals, in an absolute sense or relative to some estimate of background variability. If there are, consider that country "interesting".
* Fit a regression using ordinary least squares and a robust technique. Determine the difference in estimated parameters under the two approaches. If it is large, consider that country "interesting".
Make a data.frame of interesting countries, reporting their continent affiliation, and summary statistics on life expectancy (or GDP/capita or whatever you were looking at). Won't it be interesting to start to graphically explore them ... next week.
```{r results = 'asis'}
yearMin <- min(gDat$year)
jFun <- function(x) {
jFit <- lm(lifeExp ~ I(year - yearMin), x)
jCoef <- coef(jFit)
names(jCoef) <- NULL
return(c(intercept = jCoef[1],
slope = jCoef[2],
maxResid = max(abs(resid(jFit)))/summary(jFit)$sigma))
}
maxResids <- ddply(gDat, ~ country + continent, jFun)
foo <- ddply(maxResids, ~ continent, function(x) {
theMax <- which.max(x$maxResid)
x[theMax, ]
})
htmlPrint(foo, digits = c(0, 0, 0, 2, 2, 2))
## I can't resist, I must make a plot
xyplot(lifeExp ~ year | country, gDat,
subset = country %in% foo$country, type = c("p", "r"))
```
> Discuss: Is Ecuador really interesting? Do you think it's the most interesting country in the Americas? I doubt it. What could we do better? Making plots is ABSOLUTELY ESSENTIAL to sanity checking your claims and beliefs based on numerical computation.
### Student work
* attali-dea [source](https://gist.github.com/daattali/6673361#file-stat545a-2013-hw03_attali-dea-rmd) | [report](http://rpubs.com/daattali/stat545a-2013-hw03_attali-dea)
* baik-jon [source](https://gist.github.com/jonnybaik/6667437#file-stat545a-2013-hw03_baik-jon-rmd) | [report](http://rpubs.com/jonnybaik/stat545a-2013-hw03_baik-jon)
* bolandnazar-moh [source](https://gist.github.com/ArephB/6667983#file-stat545a-2013-hw03_bolandnazar-moh-rmd) | [report](http://rpubs.com/aref/stat545a-2013-hw03_bolandnazar-moh)
* brueckman-chr EDIT HERE
* chu-jus [source](https://gist.github.com/JustinChu/6667252#file-stat545a-2013-hw03_chu-jus-rmd) | [report](http://rpubs.com/cjustin/stat545a-2013-hw03_chu-jus)
* Zachary Daly:
[source](https://gist.github.com/ZDaly/6666365#file-stat545a-2013-hw03_daly-zac-rmd)
[report](http://rpubs.com/Zdaly/stat545a-2013-hw03_daly-zac)
* dinsdale-dan [source](https://gist.github.com/danieldinsdale/6665986#file-stat545a-2013-hw03_dinsdale-dan-rmd) | [report](http://rpubs.com/danieldinsdale/stat545a-2013-hw03_dinsdale-dan)
* gao-wen [source](https://gist.github.com/sibyl229/6668047#file-stat545a-2013-hw03_gao-wen-rmd) | [report](http://rpubs.com/less/stat545a-2013-hw03_gao-wen)
* Matthew Gingerich: [source](https://gist.github.com/MattGingerich/6667676#file-stat545a-2013-hw03_gingerich-mat-rmd) | [report](http://rpubs.com/majugi/stat545a-2013-hw03_gingerich-mat)
* hu-yum [source](https://gist.github.com/smilecat/6666754#file-stat545a-2013-hw03_hu-yum-rmd) | [report](http://rpubs.com/smilecat/stat545a-2013-hw03_hu-yum)
* jewell-sea [source](https://gist.github.com/jewellsean/bf3c28b63e6f99953153#file-stat545a-2013-hw03_jewell-sea-rmd) | [report](http://rpubs.com/jewellsean/stat545a-2013-hw03_jewell-sea)
* johnston-reb [source](https://gist.github.com/rebjoh/6667299#file-stat545a-2013-hw03_johnston-reb-rmd) | [report](http://rpubs.com/rljohn/stat545a-2013-hw03_johnston-reb)
* mahdiar khosravi [source](https://gist.github.com/Mahdiark/6671508#file-stat545a-2013-hw03_khosravi-mah-rmd) | [report](http://rpubs.com/mahdiar/stat545a-2013-hw03_khosravi-mah)
* Wooyong Lee:
[source](https://gist.github.com/folias/6661622#file-stat545a-2013-hw03_lee-woo) | [report](http://rpubs.com/folias/stat545a-2013-hw03_lee-woo)
* liao-wei: [source](https://gist.github.com/feiba/6674299#file-stat545a-2013-hw03_liao_wei-rmd) | [report](http://rpubs.com/winson/stat545a-2013-hw03_liao_wei)
* ma-hui [source](https://gist.github.com/horsehuiting/6673172#file-stat545a-2013-hw03_ma-hui-rmd) | [report](http://rpubs.com/Huiting/stat545a-2013-hw03_ma-hui)
* meng-viv [source](https://gist.github.com/vmeng321/6667418#file-stat545a-2013-hw03_meng-viv-rmd) | [report](http://rpubs.com/vmeng321/stat545a-2013-hw03_meng-viv)
* mohd abul basher-abd [source](https://gist.github.com/atante/6666678#file-stat545a-2013-hw03_mohd-abul-basher-abd-rmd) | [report](http://rpubs.com/meitantei/stat545a-2013-hw03_mohdabulbasher-abd)
* ni-jac [source](https://gist.github.com/jacknii/6663292#file-stat545a-2013-hw03_ni-jac-rmd) | [report](http://rpubs.com/jackni/stat545a-2013-hw03_ni-jac)
* Christian Okkels [source](https://gist.github.com/cbokkels/6654193#file-stat545a-2013-hw03_okkels-chr-rmd) | [report](http://rpubs.com/cbokkels/stat545a-2013-hw03_okkels-chr)
* Greg Owens: [source](https://gist.github.com/opsin/6666283#file-stat545a-2013-hw03_owens-greg-rmd) | [report](http://rpubs.com/opsin/stat545a-2013-hw03_owens-greg)
* Mina Park: [source](https://gist.github.com/parkm87/6665438#file-stat545a-2013-hw03_park-min-rmd) | [report](http://rpubs.com/parkm87/stat545a-2013-hw03_park-min)
* spencer-nei [source](https://gist.github.com/neilspencer/6558151#file-stat545a-2013-hw03_spencer-neil-rmd) | [report](http://rpubs.com/neil_spencer/stat545a-2013-hw03_spencer-nei)
* wang-ton [source](https://gist.github.com/yzhxh/6670081#file-stat545a-2013-hw03_wang-ton-rmd) |
[report](http://rpubs.com/yzhxh/stat545a-2013-hw03_wang-ton)
* Leah Weber: [source](https://gist.github.com/lweber21/6667096#file-stat545a-2013-hw03_weber-lea-rmd) | [report](http://rpubs.com/lweber21/stat545a-2013-hw03_weber-lea)
* woollard-geo [source](https://gist.github.com/geoffwoollard/6666457#file-stat545a-2013-hw03_woollard-geo-rmd) | [report](http://rpubs.com/gwoollard/stat545a-2013-hw03_woollard-geo)
* xue-xinxin [source](https://gist.github.com/xxue/6663673#file-stat545a-2013-hw03_xue-xinxin-rmd) | [report](http://rpubs.com/xxue/8829)
* yuen-ama [source](https://gist.github.com/amandammor/6666007#file-stat545a-2013-hw03_yuen-ama-rmd) | [report](http://rpubs.com/amandammor/stat545a-2013-hw03_yuen-ama)
* zhang-yim [source](https://gist.github.com/zym268/6667601#file-stat545a-2013-hw03_zhang-yim-rmd)| [report](http://rpubs.com/zym268/STAT545A-2013-hw03_zhang-yim)
* zhang-jon [source](https://gist.github.com/jzhang722/6665486#file-stat545a-2013-hw03_zhang-jon-rmd) | [report](http://rpubs.com/jzhang722/stat545a-2013-hw03_zhang-jon)
* zhang-yif
[source](https://gist.github.com/dora7870/6630819#file-stat545a-2013-hw03_zhang-yif-rmd) | [report](http://rpubs.com/dora7870/stat545a-2013-hw03_zhang-yif)
* zhang-jin [source](https://gist.github.com/0527zhangjinyuan/6673340#file-stat545a-2013-hw03_zhang-jin-rmd) | [report](http://rpubs.com/zhangjinyuan/stat545a-2013-hw03_zhang-jin)
* inskip-jes [source](https://gist.github.com/jinskip/6665098#file-stat545a-2013-hw03_inskip-jes-rmd) | [report](http://rpubs.com/jinskip/stat545a-2013-hw03_inskip-jes)
* liu-yan [source](https://gist.github.com/swallow0001/6673645#file-stat545a-2013-hw3_liu) | [report](http://rpubs.com/swallow0001/STAT545a-2013-hw3_Liu)
* haraty-mon EDIT HERE
* yuen-mac [source](https://gist.github.com/myuen/6672095#file-stat545a-2013-hw03_yuen-mac-rmd) | [report](http://rpubs.com/myuen/stat545a-2013-hw03_yuen-mac)
### Q and A
> Student: consider the exercises where we return info on the country that exhibits, e.g., the minimum life expectancy for a certain continent and/or year. What if there are ties and that minimum is achieved by more than 1 country?
I consider this a special case or extension of something we did above: identifying "low life expectancy" countries. So I will mix that code with our min/max finding code. BTW I cannot find an instance where any countries tie for lowest or longest life expectancy, but my two stage solution reflects my attempt to look into that. __I created a copy of `gDat` and altered two life expectancies to create a tie! No, people are not usually living past 110 in Norway and Sweden!__
```{r results = 'asis'}
## I couldn't find any examples of two or more countries sharing the min or max
## life expectancy in a given year; so I tweaked the data!
testDat <- gDat
toReplace <- with(testDat, country %in% c("Norway", "Sweden") & year == 2007)
testDat[toReplace, "lifeExp"] <- 110
## for each life expectancy, compute its rank within all life expectancies
## recorded that year
ggDat <- ddply(testDat, ~ year, transform,
leRank = rank(lifeExp, ties.method = "average"))
## look at the new leRank variable, if you wish
#table(ggDat$leRank)
## pull out the countries with extreme ranks ... with min and max being the most
## extreme of the extremes!
foo <- ddply(ggDat, ~ year, function(x) {
minRank <- min(x$leRank) # you could change this to find the lowest, e.g. 2
maxRank <- max(x$leRank) # and the highest 2
whoMin <- which(x$leRank == minRank) # I allow this to have length > 1
whoMax <- which(x$leRank == maxRank)
data.frame(year = x$year[whoMin[1]],
minLifeExp = x$lifeExp[whoMin[1]],
minCountries = paste(x$country[whoMin], collapse = " "),
minRank = minRank,
maxLifeExp = x$lifeExp[whoMax[1]],
maxCountries = paste(x$country[whoMax], collapse = " "),
maxRank = maxRank)
})
htmlPrint(foo, digits = c(0, 0, 0, 0, 1, 0, 0, 1))
```
Remember, the 2007 data for Norway and Sweden is fake; useful for testing this code.
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>