-
Notifications
You must be signed in to change notification settings - Fork 5
/
Forecasting 3-4 - ARMA Fitting Models.Rmd
415 lines (277 loc) · 10.3 KB
/
Forecasting 3-4 - ARMA Fitting Models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
output:
xaringan::moon_reader:
css: "my-theme.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
---
layout: true
.hheader[<a href="index.html">`r fontawesome::fa("home", fill = "steelblue")`</a>]
---
```{r setup, include=FALSE, message=FALSE}
options(htmltools.dir.version = FALSE, servr.daemon = TRUE)
knitr::opts_chunk$set(fig.height=5, fig.align="center")
library(huxtable)
```
class: center, middle, inverse
# ARIMA Models
## Fitting and Checking Models
.futnote[Eli Holmes, UW SAFS]
.citation[[email protected]]
---
```{r load_data, echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(gridExtra)
library(reshape2)
library(tseries)
library(forecast)
```
## Fitting ARIMA Models
We are now at step B of the Box-Jenkins Method.
A. Model form selection
1. Evaluate stationarity and seasonality
2. Selection of the differencing level (d)
3. Selection of the AR level (p)
4. Selection of the MA level (q)
B. **Parameter estimation**
C. **Model checking**
---
## Fitting with `auto.arima()`
`auto.arima()` (in the forecast package) has many arguments.
```
auto.arima(y, d = NA, D = NA, max.p = 5, max.q = 5, max.P = 2,
max.Q = 2, max.order = 5, max.d = 2, max.D = 1, start.p = 2,
start.q = 2, start.P = 1, start.Q = 1, stationary = FALSE,
seasonal = TRUE, ic = c("aicc", "aic", "bic"), stepwise = TRUE,
trace = FALSE, approximation = (length(x) > 150 | frequency(x) > 12),
truncate = NULL, xreg = NULL, test = c("kpss", "adf", "pp"),
seasonal.test = c("seas", "ocsb", "hegy", "ch"), allowdrift = TRUE,
allowmean = TRUE, lambda = NULL, biasadj = FALSE, parallel = FALSE,
num.cores = 2, x = y, ...)
```
When just getting started, we will focus just on a few of these.
* `trace` To print out the models that were tested.
* `stepwise` and `approximation` To use slower but better estimation when selecting model order.
* `test` The test to use to select the amount of differencing.
---
## Load the data
```{r load_data2, message=FALSE, warning=FALSE}
load("landings.RData")
landings$log.metric.tons = log(landings$metric.tons)
landings = subset(landings, Year <= 1987)
anchovy = subset(landings, Species=="Anchovy")$log.metric.tons
sardine = subset(landings, Species=="Sardine")$log.metric.tons
```
---
## Fit to the anchovy data
```{r fit.anchovy}
library(forecast)
fit <- auto.arima(anchovy)
```
Here are the values for anchovy in Table 8 of Stergiou and Christou.
Model | $\theta_1$ | drift (c) | R$^2$ | BIC | LB
------------- | --------| --- | --- | --- | --- |
(0,1,1) | 0.563 | 0.064 | 0.83 | 1775 | 5.4
Here is the equivalent values from `auto.arima()`:
```{r echo=FALSE}
res <- resid(fit)
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=2)$statistic
meany <- mean(anchovy, na.rm=TRUE)
r2 <- 1- sum(resid(fit)^2,na.rm=TRUE)/sum((anchovy-meany)^2,na.rm=TRUE)
df <- data.frame(Model="(0,1,1)", theta1=-1*coef(fit)[1], drift=coef(fit)[2], R2 = r2, BIC=fit$bic, LB=LB)
```
```{r echo=FALSE, results = 'asis'}
knitr::kable(df, row.names=FALSE, format='html')
```
---
Where do we find the components of Stergiou and Christou's Table 8?
### The estimates of $\theta_1$ and $c$
```{r}
coef(fit)
```
The `ma1` is the same as $\theta_1$ except its negative because of the way Stergiou and Christou write their MA models. They write it as
$$e_t = \eta_t - \theta_1 \eta_{t-1}$$
instead of the form that `auto.arima()` uses
$$e_t = \eta_t + \theta_1 \eta_{t-1}$$
---
### R2
```{r}
res <- resid(fit)
meany <- mean(anchovy, na.rm=TRUE)
r2 <- 1- sum(res^2,na.rm=TRUE)/sum((anchovy-meany)^2,na.rm=TRUE)
```
### Ljung-Box statistic
```{r}
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=2)$statistic
```
fitdf=2 is from the number of parameters estimated.
### BIC
BIC is in `fit$BIC`. Why is BIC different? Because there is a missing constant, which is fairly common. The absolute value of BIC is unimportant. Only its value relative to other models that you tested is important.
---
## Look at the model that were tested
By default `auto.arima()` uses AICc for model selection and the AICc values are shown. Smaller is better for AICc and AICc values that are different by less than 2 have similar data support. Look for any models with similar AICc to the best selected model. You should consider that model also.
```{r fit.anchovy.trace}
auto.arima(anchovy, trace=TRUE)
```
---
## Repeat with the sardine data
Stergiou and Christou sardine model (Table 8) is ARIMA(0,1,0):
$$x_t = x_{t-1}+e_t$$
The model selected by `auto.arima()` is ARIMA(0,0,1):
$$x_t = e_t + \theta_1 e_{t-1}$$
```{r}
auto.arima(sardine)
```
---
Why? Stergiou and Christou used the Augmented Dickey-Fuller test to determine the amount of differencing needed while the default for `auto.arima()` is to use the KPSS test.
## Repeat using `test='adf'`
Now the selected model is the same.
```{r}
fit <- auto.arima(sardine, test="adf")
fit
```
---
Compare the estimated values in Stergiou and Christou Table 8:
Model | $\theta_1$ | drift (c) | R2 | BIC | LB
------------- | --------| --- | --- | --- | --- |
(0,1,0) | NA | NA | 0.00 | 1396 | 22.2
versus from `auto.arima()`
```{r echo=FALSE}
res <- resid(fit)
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=0)$statistic
meany <- mean(sardine, na.rm=TRUE)
r2 <- 1- sum(resid(fit)^2,na.rm=TRUE)/sum((sardine-meany)^2,na.rm=TRUE)
df <- data.frame(Model="(0,1,0)", theta1=-1*coef(fit)[1], drift=coef(fit)[2], R2 = r2, BIC=fit$bic, LB=LB)
```
```{r echo=FALSE, results = 'asis'}
knitr::kable(df, row.names=FALSE, format='html')
```
---
## Missing values
These functions work fine with missing values. Missing values are denoted NA.
```{r}
anchovy.miss <- anchovy
anchovy.miss[10:14] <- NA
fit <- auto.arima(anchovy.miss)
fit
```
---
## Fit a specific ARIMA model
Sometimes you don't want to search, but rather fit an ARIMA model with a specific order. Say you wanted to fit this model:
$$x_t = \beta_1 x_{t-1} + \beta_2 x_{t-2} + e_t$$
For that you can use `Arima()` in the forecast package:
```{r}
fit.AR2 <- Arima(anchovy, order=c(2,0,0))
fit.AR2
```
---
## Model Checking
* Plot your data
* Is the plot long-tailed (Chl, some types of fish data)? Take the logarithm.
* Fit model.
* Plot your residuals
* Check your residuals for stationarity, normality, and independence
---
Ideally your response variable will be unimodal. If not, you are using an ARIMA model that doesn't produce data like yours. While you could change the assumptions about the error distribution in the model, it will be easier to transform your data.
## Plot of anchovy data
```{r fig.width=10, echo=FALSE, fig.align="center"}
par(mfrow=c(1,2))
hist(subset(landings, Species=="Anchovy")$metric.tons, main="metric.tons", xlab="anchovy")
hist(anchovy, main="log.metric.tons",xlab="log anchovy")
```
---
```{r fig.align="center"}
fit <- auto.arima(anchovy)
checkresiduals(fit)
```
---
## Modeling Workflow for non-seasonal data
* Go through Box-Jenkins Method to evaluate stationarity
* Plot the data and make decisions about transformations to make the data more unimodal
* Make some decisions about differencing and any other data transformations via the stationarity tests
* Use `auto.arima(data, trace=TRUE)` to evaluate what ARMA models best fit the data. Fix the differencing if needed.
* Determine a set of candidate models. Include a null model in the candidate list. naive and naive with drift are typical nulls.
* Test candidate models for forecast performance with cross-validation (next lecture).
---
## Summary
- `auto.arima()` in the forecast package is a good choice for selection and fitting of ARIMA models.
- `Arima()` is a good choice when you know the order (structure) of the model.
- You (may) need to know whether the mean of the data should be zero and whether it is stationary around a linear line.
- `include.mean=TRUE` means the mean is not zero
- `include.drift=TRUE` means fit a model that fluctuates around a trend (up or down)
### Seasonality
These functions will also fit seasonal models. We will address seasonality later.
---
## Final note: stepwise model selection
Stepwise model selection is fast and useful if you need to explore many models and it takes awhile to fit. Our models fit quickly and we don't have season in our models. Though it will not make a difference for this particular dataset, in general set `stepwise=FALSE` to do a more thorough model search.
```{r fit.anchovy2}
auto.arima(anchovy, stepwise=FALSE, approximation=FALSE)
```
## Inputting data: one response variable
If your data look like this:
```
Year Species metric.tons
2018, Fish1, 1
2019, Fish1, 2
2018, Fish2, 3
2019, Fish2, 4
2018, Fish3, 6
2019, Fish4, NA
```
with this code:
```
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
save(test, file="test.RData")
```
---
## Inputting data: many response variables
Read in a file where the data are in columns. If your data look like this with each species (or site) across the columns:
```
Year,Anchovy,Sardine,Chub mackerel,Horse mackerel,Mackerel,Jack Mackerel
1964,5449.2,12984.4,1720.7,4022.4,NA,NA
1965,4263.5,10611.1,1278.5,4158.3,NA,NA
1966,5146.4,11437.8,802.6,3012.1,NA,NA
```
Use this code:
```
library(reshape2)
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
melt(test, id="Year", value.name="metric.tons", variable.name="Species")
save(test, file="test.RData")
```
---
## Inputting data: many response variables, two time variables
If your data also have, say, a month (or qtr) column, use this code:
```
Year,Month,Anchovy,Sardine,Chub mackerel,Horse mackerel,Mackerel,Jack Mackerel
1964,1,5449.2,12984.4,1720.7,4022.4,NA,NA
1964,2,4263.5,10611.1,1278.5,4158.3,NA,NA
1964,3,5146.4,11437.8,802.6,3012.1,NA,NA
```
Use this code:
```
library(reshape2)
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
melt(test, id=c("Year","Month"), value.name="metric.tons", variable.name="Species")
save(test, file="test.RData")
```
---
## Inputting data: one response variable, multiple explanatory variables
```
Year, Anchovy, SST, Mackerel
1964, 5449.2, 24.4, 1720.7
1965, 4263.5, 30.1, 1278.5
1966, 5146.4, 23.8, 802.6
```
Use this code:
```
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
save(test, file="test.RData")
```
Use this `lm()` model (or gam() etc):
```
fit <- lm(Anchovy ~ SST + Mackerel, data=test)
```
---