-
Notifications
You must be signed in to change notification settings - Fork 290
/
21-inferential-analysis.Rmd
435 lines (318 loc) · 24.2 KB
/
21-inferential-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
```{r inferential-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(poissonreg)
library(infer)
tidymodels_prefer()
theme_set(theme_bw())
data("bioChemists", package = "pscl")
```
# Inferential Analysis {#inferential}
:::rmdnote
In Section \@ref(model-types), we outlined a taxonomy of models and said that most models can be categorized as descriptive, inferential, and/or predictive.
:::
Most of the chapters in this book have focused on models from the perspective of the accuracy of predicted values, an important quality of models for all purposes but most relevant for predictive models. Inferential models are usually created not only for their predictions, but also to make inferences or judgments about some component of the model, such as a coefficient value or other parameter. These results are often used to answer some (hopefully) pre-defined questions or hypotheses. In predictive models, predictions on hold-out data are used to validate or characterize the quality of the model. Inferential methods focus on validating the probabilistic or structural assumptions that are made prior to fitting the model.
For example, in ordinary linear regression, the common assumption is that the residual values are independent and follow a Gaussian distribution with a constant variance. While you may have scientific or domain knowledge to lend credence to this assumption for your model analysis, the residuals from the fitted model are usually examined to determine if the assumption was a good idea. As a result, the methods for determining if the model's assumptions have been met are not as simple as looking at holdout predictions, although that can be very useful as well.
We will use p-values in this chapter. However, the tidymodels framework tends to promote confidence intervals over p-values as a method for quantifying the evidence for an alternative hypothesis. As previously shown in Section \@ref(tidyposterior), Bayesian methods are often superior to both p-values and confidence intervals in terms of ease of interpretation (but they can be more computationally expensive).
:::rmdwarning
There has been a push in recent years to move away from p-values in favor of other methods [@pvalue]. See Volume 73 of [*The American Statistician*](https://www.tandfonline.com/toc/utas20/73/) for more information and discussion.
:::
In this chapter, we describe how to use `r pkg(tidymodels)` for fitting and assessing inferential models. In some cases, the tidymodels framework can help users work with the objects produced by their models. In others, it can help assess the quality of a given model.
## Inference for Count Data
To understand how tidymodels packages can be used for inferential modeling, let's focus on an example with count data. We'll use biochemistry publication data from the `r pkg(pscl)` package. These data consist of information on 915 Ph.D. biochemistry graduates and tries to explain factors that impact their academic productivity (measured via number or count of articles published within three years). The predictors include the gender of the graduate, their marital status, the number of children of the graduate that are at least five years old, the prestige of their department, and the number of articles produced by their mentor in the same time period. The data reflect biochemistry doctorates who finished their education between 1956 and 1963. The data are a somewhat biased sample of all of the biochemistry doctorates given during this period (based on completeness of information).
:::rmdnote
Recall that in Chapter \@ref(trust) we asked the question "Is our model applicable for predicting a specific data point?" It is very important to define what populations an inferential analysis applies to. For these data, the results would likely apply to biochemistry doctorates given around the time frame that the data were collected. Does it also apply to other chemistry doctorate types (e.g., medicinal chemistry, etc)? These are important questions to address (and document) when conducting inferential analyses.
:::
A plot of the data shown in Figure \@ref(fig:counts) indicates that many graduates did not publish any articles in this time and that the outcome follows a right-skewed distribution:
```{r inferential-count-dist, eval=FALSE}
library(tidymodels)
tidymodels_prefer()
data("bioChemists", package = "pscl")
ggplot(bioChemists, aes(x = art)) +
geom_histogram(binwidth = 1, color = "white") +
labs(x = "Number of articles within 3y of graduation")
```
```{r counts, ref.label = "inferential-count-dist"}
#| echo = FALSE,
#| out.width = "80%",
#| fig.cap = "Distribution of the number of articles written within 3 years of graduation",
#| fig.alt = "The distribution of the number of articles written within 3 years of graduation. The distribution is right-skewed and most of the data have counts of zero or one."
```
Since the outcome data are counts, the most common distribution assumption to make is that the outcome has a Poisson distribution. This chapter will use these data for several types of analyses.
## Comparisons with Two-Sample Tests
We can start with hypothesis testing. The original author's goal with this data set on biochemistry publication data was to determine if there is a difference in publications between men and women [@Long1992]. The data from the study show:
```{r inferential-counts}
bioChemists %>%
group_by(fem) %>%
summarize(counts = sum(art), n = length(art))
```
There were many more publications by men, although there were also more men in the data. The simplest approach to analyzing these data would be to do a two-sample comparison using the `poisson.test()` function in the `r pkg(stats)` package. It requires the counts for one or two groups.
For our application, the hypotheses to compare the two sexes are:
\begin{align}
H_0&: \lambda_m = \lambda_f \notag \\
H_a&: \lambda_m \ne \lambda_f \notag
\end{align}
where the $\lambda$ values are the rates of publications (over the same time period).
A basic application of the test is:^[The `T` argument allows us to account for the time when the events (publications) were counted, which was three years for both men and women. There are more men than women in these data, but `poisson.test()` has limited functionality so more sophisticated analysis can be used to account for this difference.]
```{r inferential-test-basic}
poisson.test(c(930, 619), T = 3)
```
The function reports a p-value as well as a confidence interval for the ratio of the publication rates. The results indicate that the observed difference is greater than the experiential noise and favors $H_a$.
One issue with using this function is that the results come back as an `htest` object. While this type of object has a well-defined structure, it can be difficult to consume for subsequent operations such as reporting or visualizations. The most impactful tool that tidymodels offers for inferential models is the `tidy()` functions in the `r pkg(broom)` package. As previously seen, this function makes a well-formed, predictably named tibble from the object. We can `tidy()` the results of our two-sample comparison test:
```{r inferential-test-tidy}
poisson.test(c(930, 619)) %>%
tidy()
```
:::rmdnote
Between the [`r pkg(broom)`](https://broom.tidymodels.org/) and [`r pkg(broom.mixed)`](https://CRAN.R-project.org/package=broom.mixed) packages, there are `tidy()` methods for more than 150 models.
:::
While the Poisson distribution is reasonable, we might also want to assess using fewer distributional assumptions. Two methods that might be helpful are the bootstrap and permutation tests [@davison1997bootstrap].
The `r pkg(infer)` package, part of the tidymodels framework, is a powerful and intuitive tool for hypothesis testing [@ModernDive]. Its syntax is concise and designed for nonstatisticians.
First, we `specify()` that we will use the difference in the mean number of articles between the sexes and then `calculate()` the statistic from the data. Recall that the maximum likelihood estimator for the Poisson mean is the sample mean. The hypotheses tested here are the same as the previous test (but are conducted using a different testing procedure).
With `r pkg(infer)`, we specify the outcome and covariate, then state the statistic of interest:
```{r inferential-mean-diff-obs}
library(infer)
observed <-
bioChemists %>%
specify(art ~ fem) %>%
calculate(stat = "diff in means", order = c("Men", "Women"))
observed
```
From here, we compute a confidence interval for this mean by creating the bootstrap distribution via `generate()`; the same statistic is computed for each resampled version of the data:
```{r inferential-mean-diff-boot-gen}
set.seed(2101)
bootstrapped <-
bioChemists %>%
specify(art ~ fem) %>%
generate(reps = 2000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("Men", "Women"))
bootstrapped
```
A percentile interval is calculated using:
```{r inferential-mean-diff-boot-ci}
percentile_ci <- get_ci(bootstrapped)
percentile_ci
```
The `r pkg(infer)` package has a high-level API for showing the analysis results, as shown in Figure \@ref(fig:bootstrapped-mean).
```{r inferential-mean-diff-boot, eval = FALSE}
visualize(bootstrapped) +
shade_confidence_interval(endpoints = percentile_ci)
```
```{r bootstrapped-mean, ref.label = "inferential-mean-diff-boot"}
#| echo = FALSE,
#| out.width = "80%",
#| fig.cap = "The bootstrap distribution of the difference in means. The highlighted region is the confidence interval.",
#| fig.alt = "The bootstrap distribution of the difference in means. The highlighted region is the confidence interval, which does not include a value of zero."
```
Since the interval visualized in in Figure \@ref(fig:bootstrapped-mean) does not include zero, these results indicate that men have published more articles than women.
If we require a p-value, the `r pkg(infer)` package can compute the value via a permutation test, shown in the following code. The syntax is very similar to the bootstrapping code we used earlier. We add a `hypothesize()` verb to state the type of assumption to test and the `generate()` call contains an option to shuffle the data.
```{r inferential-mean-diff-perm-gen}
set.seed(2102)
permuted <-
bioChemists %>%
specify(art ~ fem) %>%
hypothesize(null = "independence") %>%
generate(reps = 2000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Men", "Women"))
permuted
```
The following visualization code is also very similar to the bootstrap approach. This code generates Figure \@ref(fig:permutation-dist) where the vertical line signifies the observed value:
```{r inferential-mean-diff-perm, eval = FALSE}
visualize(permuted) +
shade_p_value(obs_stat = observed, direction = "two-sided")
```
```{r permutation-dist, ref.label = "inferential-mean-diff-perm"}
#| echo = FALSE,
#| out.width = "80%",
#| fig.cap = "Empirical distribution of the test statistic under the null hypothesis. The vertical line indicates the observed test statistic.",
#| fig.alt = "The empirical distribution of the test statistic under the null hypothesis. The vertical line indicates the observed test statistic and is far away form the mainstream of the distribution."
```
The actual p-value is:
```{r inferential-mean-diff-perm-pvalue}
permuted %>%
get_p_value(obs_stat = observed, direction = "two-sided")
```
The vertical line representing the null hypothesis in Figure \@ref(fig:permutation-dist) is far away from the permutation distribution. This means, if in fact the null hypothesis were true, the likelihood is exceedingly small of observing data at least as extreme as what is at hand.
The two-sample tests shown in this section are probably suboptimal because they do not account for other factors that might explain the observed relationship between publication rate and sex. Let's move to a more complex model that can consider additional covariates.
## Log-Linear Models
The focus of the rest of this chapter will be on a generalized linear model [@Dobson99] where we assume the counts follow a Poisson distribution. For this model, the covariates/predictors enter the model in a log-linear fashion:
$$
\log(\lambda) = \beta_0 + \beta_1x_1 + \ldots + \beta_px_p
$$
where $\lambda$ is the expected value of the counts.
Let's fit a simple model that contains all of the predictor columns. The `r pkg(poissonreg)` package, a `r pkg(parsnip)` extension package in tidymodels, will fit this model specification:
```{r inferential-glm}
library(poissonreg)
# default engine is 'glm'
log_lin_spec <- poisson_reg()
log_lin_fit <-
log_lin_spec %>%
fit(art ~ ., data = bioChemists)
log_lin_fit
```
The `tidy()` method succinctly summarizes the coefficients for the model (along with 90% confidence intervals):
```{r inferential-glm-tidy}
tidy(log_lin_fit, conf.int = TRUE, conf.level = 0.90)
```
In this output, the p-values correspond to separate hypothesis tests for each parameter:
```{=tex}
\begin{align}
H_0&: \beta_j = 0 \notag \\
H_a&: \beta_j \ne 0 \notag
\end{align}
```
for each of the model parameters. Looking at these results, `phd` (the prestige of their department) may not have any relationship with the outcome.
While the Poisson distribution is the routine assumption for data like these, it may be beneficial to conduct a rough check of the model assumptions by fitting the models without using the Poisson likelihood to calculate the confidence intervals. The `r pkg(rsample)` package has a convenience function to compute bootstrap confidence intervals for `lm()` and `glm()` models. We can use this function, while explicitly declaring `family = poisson`, to compute a large number of model fits. By default, we compute a 90% confidence bootstrap-t interval (percentile intervals are also available):
```{r inferential-glm-ci}
set.seed(2103)
glm_boot <-
reg_intervals(art ~ ., data = bioChemists, model_fn = "glm", family = poisson)
glm_boot
```
:::rmdwarning
When we compare these results (in Figure \@ref(fig:glm-intervals)) to the purely parametric results from `glm()`, the bootstrap intervals are somewhat wider. If the data were truly Poisson, these intervals would have more similar widths.
:::
```{r glm-intervals}
#| echo = FALSE,
#| fig.cap = "Two types of confidence intervals for the Poisson regression model",
#| fig.alt = "Two types of confidence intervals for the Poisson regression model. the interval for the PhD model is the only interval overlapping zero. The parametric intervals tend to be wider than the bootstrap intervals. "
glm_boot %>%
select(term, method = .method, .estimate, .lower, .upper) %>%
bind_rows(
tidy(log_lin_fit, conf.int = TRUE, conf.level = 0.90) %>%
filter(term != "(Intercept)") %>%
mutate(method = "parametric") %>%
select(term, method, .estimate = estimate, .lower = conf.low, .upper = conf.high)
)%>%
ggplot(aes(x = .estimate, y = term, color = method, pch = method)) +
geom_vline(xintercept = 0, lty = 3) +
geom_point(size = 2.5, position = position_dodge(width = 1 / 2)) +
geom_errorbar(aes(xmin = .lower, xmax = .upper),
width = 1 / 4,
position = position_dodge(width = 1 / 2)) +
labs(x = "GLM coefficients", y = NULL, color = NULL, pch = NULL) +
scale_color_brewer(palette = "Paired")
```
Determining which predictors to include in the model is a difficult problem. One approach is to conduct likelihood ratio tests (LRT) [@McCullaghNelder89] between nested models. Based on the confidence intervals, we have evidence that a simpler model without `phd` may be sufficient. Let's fit a smaller model, then conduct a statistical test:
\begin{align}
H_0&: \beta_{phd} = 0 \notag \\
H_a&: \beta_{phd} \ne 0 \notag
\end{align}
This hypothesis was previously tested when we showed the tidied results for `log_lin_fit`. That particular approach used results from a single model fit via a Wald statistic (i.e., the parameter divided by its standard error). For that approach, the p-value was `r tidy(log_lin_fit) %>% filter(term == "phd") %>% pluck("p.value") %>% format.pval()`. We can tidy the results for the LRT to get the p-value:
```{r inferential-reduced, warning = FALSE}
log_lin_reduced <-
log_lin_spec %>%
fit(art ~ ment + kid5 + fem + mar, data = bioChemists)
anova(
extract_fit_engine(log_lin_reduced),
extract_fit_engine(log_lin_fit),
test = "LRT"
) %>%
tidy()
```
The results are the same and, based on these and the confidence interval for this parameter, we'll exclude `phd` from further analyses since it does not appear to be associated with the outcome.
## A More Complex Model
We can move into even more complex models within our tidymodels approach. For count data, there are occasions where the number of zero counts is larger than what a simple Poisson distribution would prescribe. A more complex model appropriate for this situation is the zero-inflated Poisson (ZIP) model; see @Mullahy, @Lambert1992, and @JSSv027i08. Here, there are two sets of covariates: one for the count data and others that affect the probability (denoted as $\pi$) of zeros. The equation for the mean $\lambda$ is:
$$\lambda = 0 \pi + (1 - \pi) \lambda_{nz}$$
where
```{=tex}
\begin{align}
\log(\lambda_{nz}) &= \beta_0 + \beta_1x_1 + \ldots + \beta_px_p \notag \\
\log\left(\frac{\pi}{1-\pi}\right) &= \gamma_0 + \gamma_1z_1 + \ldots + \gamma_qz_q \notag
\end{align}
```
and the $x$ covariates affect the count values while the $z$ covariates influence the probability of a zero. The two sets of predictors do not need to be mutually exclusive.
We'll fit a model with a full set of $z$ covariates:
```{r inference-zip-model}
zero_inflated_spec <- poisson_reg() %>% set_engine("zeroinfl")
zero_inflated_fit <-
zero_inflated_spec %>%
fit(art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment,
data = bioChemists)
zero_inflated_fit
```
Since the coefficients for this model are also estimated using maximum likelihood, let's try to use another likelihood ratio test to understand if the new model terms are helpful. We will *simultaneously* test that:
```{=tex}
\begin{align}
H_0&: \gamma_1 = 0, \gamma_2 = 0, \cdots, \gamma_5 = 0 \notag \\
H_a&: \text{at least one } \gamma \ne 0 \notag
\end{align}
```
Let's try ANOVA again:
```{r inference-zip-anova, error = TRUE}
anova(
extract_fit_engine(zero_inflated_fit),
extract_fit_engine(log_lin_reduced),
test = "LRT"
) %>%
tidy()
```
An `anova()` method isn't implemented for `zeroinfl` objects!
An alternative is to use an *information criterion statistic*, such as the Akaike information criterion (AIC) [@claeskens2016statistical]. This computes the log-likelihood (from the training set) and penalizes that value based on the training set size and the number of model parameters. In R's parameterization, smaller AIC values are better. In this case, we are not conducting a formal statistical test but *estimating* the ability of the data to fit the data.
The results indicate that the ZIP model is preferable:
```{r inference-zip-aic}
zero_inflated_fit %>% extract_fit_engine() %>% AIC()
log_lin_reduced %>% extract_fit_engine() %>% AIC()
```
However, it's hard to contextualize this pair of single values and assess *how* different they actually are. To solve this problem, we'll resample a large number of each of these two models. From these, we can compute the AIC values for each and determine how often the results favor the ZIP model. Basically, we will be characterizing the uncertainty of the AIC statistics to gauge their difference relative to the noise in the data.
We'll also compute more bootstrap confidence intervals for the parameters in a bit so we specify the `apparent = TRUE` option when creating the bootstrap samples. This is required for some types of intervals.
First, we create the 4,000 model fits:
```{r inference-zip-comp}
zip_form <- art ~ fem + mar + kid5 + ment | fem + mar + kid5 + phd + ment
glm_form <- art ~ fem + mar + kid5 + ment
set.seed(2104)
bootstrap_models <-
bootstraps(bioChemists, times = 2000, apparent = TRUE) %>%
mutate(
glm = map(splits, ~ fit(log_lin_spec, glm_form, data = analysis(.x))),
zip = map(splits, ~ fit(zero_inflated_spec, zip_form, data = analysis(.x)))
)
bootstrap_models
```
Now we can extract the model fits and their corresponding AIC values:
```{r inference-zip-aic-resampled}
bootstrap_models <-
bootstrap_models %>%
mutate(
glm_aic = map_dbl(glm, ~ extract_fit_engine(.x) %>% AIC()),
zip_aic = map_dbl(zip, ~ extract_fit_engine(.x) %>% AIC())
)
mean(bootstrap_models$zip_aic < bootstrap_models$glm_aic)
```
It seems definitive from these results that accounting for the excessive number of zero counts is a good idea.
:::rmdnote
We could have used `fit_resamples()` or a workflow set to conduct these computations. In this section, we used `mutate()` and `map()` to compute the models to demonstrate how one might use tidymodels tools for models that are not supported by one of the `r pkg(parsnip)` packages.
:::
Since we have computed the resampled model fits, let's create bootstrap intervals for the zero probability model coefficients (i.e., the $\gamma_j$). We can extract these with the `tidy()` method and use the `type = "zero"` option to obtain these estimates:
```{r inference-zip-coefs}
bootstrap_models <-
bootstrap_models %>%
mutate(zero_coefs = map(zip, ~ tidy(.x, type = "zero")))
# One example:
bootstrap_models$zero_coefs[[1]]
```
It's a good idea to visualize the bootstrap distributions of the coefficients, as in Figure \@ref(fig:zip-bootstrap).
```{r inference-zip-bootstrap, eval=FALSE}
bootstrap_models %>%
unnest(zero_coefs) %>%
ggplot(aes(x = estimate)) +
geom_histogram(bins = 25, color = "white") +
facet_wrap(~ term, scales = "free_x") +
geom_vline(xintercept = 0, lty = 2, color = "gray70")
```
```{r zip-bootstrap, ref.label = "inference-zip-bootstrap"}
#| echo = FALSE,
#| fig.cap = "Bootstrap distributions of the ZIP model coefficients. The vertical lines indicate the observed estimates. ",
#| fig.alt = "Bootstrap distributions of the ZIP model coefficients. The vertical lines indicate the observed estimates. The ment predictor that appears to be important to the model."
```
One of the covariates (`ment`) that appears to be important has a very skewed distribution. The extra space in some of the facets indicates there are some outliers in the estimates. This *might* occur when models did not converge; those results probably should be excluded from the resamples. For the results visualized in Figure \@ref(fig:zip-bootstrap), the outliers are due only to extreme parameter estimates; all of the models converged.
The `r pkg(rsample)` package contains a set of functions named `int_*()` that compute different types of bootstrap intervals. Since the `tidy()` method contains standard error estimates, the bootstrap-t intervals can be computed. We'll also compute the standard percentile intervals. By default, 90% confidence intervals are computed.
```{r inference-zip-intervals}
bootstrap_models %>% int_pctl(zero_coefs)
bootstrap_models %>% int_t(zero_coefs)
```
From these results, we can get a good idea of which predictor(s) to include in the zero count probability model. It may be sensible to refit a smaller model to assess if the bootstrap distribution for `ment` is still skewed.
## More Inferential Analysis {#inference-options}
This chapter demonstrated just a small subset of what is available for inferential analysis in tidymodels and has focused on resampling and frequentist methods. Arguably, Bayesian analysis is a very effective and often superior approach for inference. A variety of Bayesian models are available via `r pkg(parsnip)`. Additionally, the `r pkg(multilevelmod)` package enables users to fit hierarchical Bayesian and non-Bayesian models (e.g., mixed models). The `r pkg(broom.mixed)` and `r pkg(tidybayes)` packages are excellent tools for extracting data for plots and summaries. Finally, for data sets with a single hierarchy, such as simple longitudinal or repeated measures data, `r pkg(rsample)`'s `group_vfold_cv()` function facilitates straightforward out-of-sample characterizations of model performance.
## Chapter Summary {#inference-summary}
The tidymodels framework is for more than predictive modeling alone. Packages and functions from tidymodels can be used for hypothesis testing, as well as fitting and assessing inferential models. The tidymodels framework provides support for working with non-tidymodels R models, and can help assess the statistical qualities of your models.