Forecasting 3-4 - ARMA Fitting Models.Rmd

---
output:
  xaringan::moon_reader:
    css: "my-theme.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
---

layout: true

.hheader[<a href="index.html">`r fontawesome::fa("home", fill = "steelblue")`</a>]

---

```{r setup, include=FALSE, message=FALSE}
options(htmltools.dir.version = FALSE, servr.daemon = TRUE)
knitr::opts_chunk$set(fig.height=5, fig.align="center")
library(huxtable)
```

class: center, middle, inverse
# ARIMA Models
## Fitting and Checking Models

.futnote[Eli Holmes, UW SAFS]

.citation[eeholmes@uw.edu]

---

```{r load_data, echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(gridExtra)
library(reshape2)
library(tseries)
library(forecast)
```

## Fitting ARIMA Models

We are now at step B of the Box-Jenkins Method.  

A. Model form selection
    1. Evaluate stationarity and seasonality
    2. Selection of the differencing level (d)
    3. Selection of the AR level (p)
    4. Selection of the MA level (q)
    
B. **Parameter estimation**

C. **Model checking**

---

## Fitting with `auto.arima()`

`auto.arima()` (in the forecast package) has many arguments.

```
auto.arima(y, d = NA, D = NA, max.p = 5, max.q = 5, max.P = 2,
  max.Q = 2, max.order = 5, max.d = 2, max.D = 1, start.p = 2,
  start.q = 2, start.P = 1, start.Q = 1, stationary = FALSE,
  seasonal = TRUE, ic = c("aicc", "aic", "bic"), stepwise = TRUE,
  trace = FALSE, approximation = (length(x) > 150 | frequency(x) > 12),
  truncate = NULL, xreg = NULL, test = c("kpss", "adf", "pp"),
  seasonal.test = c("seas", "ocsb", "hegy", "ch"), allowdrift = TRUE,
  allowmean = TRUE, lambda = NULL, biasadj = FALSE, parallel = FALSE,
  num.cores = 2, x = y, ...)
```

When just getting started, we will focus just on a few of these.

* `trace` To print out the models that were tested.
* `stepwise` and `approximation` To use slower but better estimation when selecting model order.
* `test` The test to use to select the amount of differencing.

---

## Load the data

```{r load_data2, message=FALSE, warning=FALSE}
load("landings.RData")
landings$log.metric.tons = log(landings$metric.tons)
landings = subset(landings, Year <= 1987)
anchovy = subset(landings, Species=="Anchovy")$log.metric.tons
sardine = subset(landings, Species=="Sardine")$log.metric.tons
```


---

## Fit to the anchovy data

```{r fit.anchovy}
library(forecast)
fit <- auto.arima(anchovy)
```

Here are the values for anchovy in Table 8 of Stergiou and Christou.

Model  | $\theta_1$ | drift (c) | R$^2$ | BIC | LB
------------- | --------| --- | --- | --- | --- | 
(0,1,1) | 0.563 | 0.064 | 0.83 | 1775 | 5.4

Here is the equivalent values from `auto.arima()`:

```{r echo=FALSE}
res <- resid(fit)
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=2)$statistic
meany <- mean(anchovy, na.rm=TRUE)
r2 <- 1- sum(resid(fit)^2,na.rm=TRUE)/sum((anchovy-meany)^2,na.rm=TRUE)

df <- data.frame(Model="(0,1,1)", theta1=-1*coef(fit)[1], drift=coef(fit)[2], R2 = r2, BIC=fit$bic, LB=LB)
```

```{r echo=FALSE, results = 'asis'}
knitr::kable(df, row.names=FALSE, format='html')
```


---

Where do we find the components of Stergiou and Christou's Table 8?

### The estimates of $\theta_1$ and $c$

```{r}
coef(fit)
```
The `ma1` is the same as $\theta_1$ except its negative because of the way Stergiou and Christou write their MA models.  They write it as 

$$e_t = \eta_t - \theta_1 \eta_{t-1}$$

instead of the form that `auto.arima()` uses

$$e_t = \eta_t + \theta_1 \eta_{t-1}$$

---

### R2

```{r}
res <- resid(fit)
meany <- mean(anchovy, na.rm=TRUE)
r2 <- 1- sum(res^2,na.rm=TRUE)/sum((anchovy-meany)^2,na.rm=TRUE)
```

### Ljung-Box statistic

```{r}
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=2)$statistic
```
fitdf=2 is from the number of parameters estimated.

### BIC

BIC is in `fit$BIC`.  Why is BIC different? Because there is a missing constant, which is fairly common.  The absolute value of BIC is unimportant.  Only its value relative to other models that you tested is important.

---

## Look at the model that were tested

By default `auto.arima()` uses AICc for model selection and the AICc values are shown.  Smaller is better for AICc and AICc values that are different by less than 2 have similar data support.  Look for any models with similar AICc to the best selected model.  You should consider that model also.

```{r fit.anchovy.trace}
auto.arima(anchovy, trace=TRUE)
```

---

## Repeat with the sardine data

Stergiou and Christou sardine model (Table 8) is ARIMA(0,1,0):
$$x_t = x_{t-1}+e_t$$


The model selected by `auto.arima()` is ARIMA(0,0,1):
$$x_t = e_t + \theta_1 e_{t-1}$$

```{r}
auto.arima(sardine)
```


---

Why?  Stergiou and Christou used the Augmented Dickey-Fuller test to determine the amount of differencing needed while the default for `auto.arima()` is to use the KPSS test.

## Repeat using `test='adf'`

Now the selected model is the same.

```{r}
fit <- auto.arima(sardine, test="adf")
fit
```

---

Compare the estimated values in Stergiou and Christou Table 8:

Model  | $\theta_1$ | drift (c) | R2 | BIC | LB
------------- | --------| --- | --- | --- | --- | 
(0,1,0) | NA | NA | 0.00 | 1396 | 22.2

versus from `auto.arima()`

```{r echo=FALSE}
res <- resid(fit)
LB <- Box.test(res, type="Ljung-Box", lag=12, fitdf=0)$statistic
meany <- mean(sardine, na.rm=TRUE)
r2 <- 1- sum(resid(fit)^2,na.rm=TRUE)/sum((sardine-meany)^2,na.rm=TRUE)

df <- data.frame(Model="(0,1,0)", theta1=-1*coef(fit)[1], drift=coef(fit)[2], R2 = r2, BIC=fit$bic, LB=LB)
```

```{r echo=FALSE, results = 'asis'}
knitr::kable(df, row.names=FALSE, format='html')
```

---

## Missing values

These functions work fine with missing values.  Missing values are denoted NA.

```{r}
anchovy.miss <- anchovy
anchovy.miss[10:14] <- NA
fit <- auto.arima(anchovy.miss)
fit
```

---

## Fit a specific ARIMA model

Sometimes you don't want to search, but rather fit an ARIMA model with a specific order.  Say you wanted to fit this model:
$$x_t = \beta_1 x_{t-1} + \beta_2 x_{t-2} + e_t$$
For that you can use `Arima()` in the forecast package:

```{r}
fit.AR2 <- Arima(anchovy, order=c(2,0,0))
fit.AR2
```

---

## Model Checking

* Plot your data

* Is the plot long-tailed (Chl, some types of fish data)? Take the logarithm.

* Fit model.

* Plot your residuals

* Check your residuals for stationarity, normality, and independence

---

Ideally your response variable will be unimodal.  If not, you are using an ARIMA model that doesn't produce data like yours.  While you could change the assumptions about the error distribution in the model, it will be easier to transform your data.

## Plot of anchovy data

```{r fig.width=10, echo=FALSE, fig.align="center"}
par(mfrow=c(1,2))
hist(subset(landings, Species=="Anchovy")$metric.tons, main="metric.tons", xlab="anchovy")
hist(anchovy, main="log.metric.tons",xlab="log anchovy")
```

---


```{r fig.align="center"}
fit <- auto.arima(anchovy)
checkresiduals(fit)
```

---

## Modeling Workflow for non-seasonal data

* Go through Box-Jenkins Method to evaluate stationarity

* Plot the data and make decisions about transformations to make the data more unimodal

* Make some decisions about differencing and any other data transformations via the stationarity tests

* Use `auto.arima(data, trace=TRUE)` to evaluate what ARMA models best fit the data.  Fix the differencing if needed.

* Determine a set of candidate models.  Include a null model in the candidate list.  naive and naive with drift are typical nulls.

* Test candidate models for forecast performance with cross-validation (next lecture).

---

## Summary

- `auto.arima()` in the forecast package is a good choice for selection and fitting of ARIMA models.

- `Arima()` is a good choice when you know the order (structure) of the model.

- You (may) need to know whether the mean of the data should be zero and whether it is stationary around a linear line.
    - `include.mean=TRUE` means the mean is not zero
    - `include.drift=TRUE` means fit a model that fluctuates around a trend (up or down)

### Seasonality

These functions will also fit seasonal models.  We will address seasonality later.

---

## Final note: stepwise model selection

Stepwise model selection is fast and useful if you need to explore many models and it takes awhile to fit.  Our models fit quickly and we don't have season in our models.  Though it will not make a difference for this particular dataset, in general set `stepwise=FALSE` to do a more thorough model search.

```{r fit.anchovy2}
auto.arima(anchovy, stepwise=FALSE, approximation=FALSE)
```


## Inputting data: one response variable

If your data look like this:

```
  Year Species metric.tons
 2018,   Fish1,           1
 2019,   Fish1,           2
 2018,   Fish2,           3
 2019,   Fish2,           4
 2018,   Fish3,           6
 2019,   Fish4,          NA
```
with this code:

```
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
save(test, file="test.RData")
```

---

## Inputting data: many response variables

Read in a file where the data are in columns.  If your data look like this with each species (or site) across the columns:

```
Year,Anchovy,Sardine,Chub mackerel,Horse mackerel,Mackerel,Jack Mackerel
1964,5449.2,12984.4,1720.7,4022.4,NA,NA
1965,4263.5,10611.1,1278.5,4158.3,NA,NA
1966,5146.4,11437.8,802.6,3012.1,NA,NA
```

Use this code:

```
library(reshape2)
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
melt(test, id="Year", value.name="metric.tons", variable.name="Species")
save(test, file="test.RData")
```

---

## Inputting data: many response variables, two time variables

If your data also have, say, a month (or qtr) column, use this code:

```
Year,Month,Anchovy,Sardine,Chub mackerel,Horse mackerel,Mackerel,Jack Mackerel
1964,1,5449.2,12984.4,1720.7,4022.4,NA,NA
1964,2,4263.5,10611.1,1278.5,4158.3,NA,NA
1964,3,5146.4,11437.8,802.6,3012.1,NA,NA
```
Use this code:

```
library(reshape2)
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
melt(test, id=c("Year","Month"), value.name="metric.tons", variable.name="Species")
save(test, file="test.RData")
```

---

## Inputting data: one response variable, multiple explanatory variables

```
Year, Anchovy, SST,  Mackerel
1964, 5449.2,  24.4, 1720.7
1965, 4263.5,  30.1, 1278.5
1966, 5146.4,  23.8,  802.6
```

Use this code:

```
test <- read.csv("Data/test.csv", stringsAsFactors = FALSE)
save(test, file="test.RData")
```

Use this `lm()` model (or gam() etc):

```
fit <- lm(Anchovy ~ SST + Mackerel, data=test)
```

---