generated from cotes2020/chirpy-starter
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jachym.Barvinek
committed
Dec 8, 2024
1 parent
ee5497f
commit e764864
Showing
8 changed files
with
1,734 additions
and
89 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
--- | ||
title: 'Generalized linear models – theory and data examples' | ||
layout: post | ||
date: 2024-12-07 | ||
tags: [probability theory,statistics,case study] | ||
pin: true | ||
math: true | ||
--- | ||
|
||
This article is a sort of introduction to generalized linear models, | ||
explaining the basic theory how exactly they relate to linear regression | ||
and illustrate why they may be useful on some data examples. | ||
|
||
One important case where linear regression fails when modelling time series is when the data are positive integers close to zero | ||
(e.g. daily counts of purchases of a low-selling product.) | ||
This is because there are certain assumptions encoded in the model, | ||
that we often justify with some handwavy reference | ||
to the central limit theorem. | ||
But in this case it may easily lead to poor results: | ||
For small counts, the distinction between normal and a discrete counting distribution will be significant. | ||
(See the article about [counting events](/posts/counting-events) for a more detailed discussion.) | ||
|
||
Generalized linear models (GLMs) max fix this and similar issues by allowing other kinds of data distributions | ||
while keeping a lot of the desirable and familiar properties of linear regression. | ||
|
||
## Linear regression | ||
<span title="who I expect to read this blog">Everybody<span> | ||
knows good-old linear regression. | ||
But it's a good point where to start with the topic, so let's recap a bit. | ||
First, let's study the relation of linear regression and ordinary least squares (OLS). | ||
|
||
### Gauss–Markov theorem | ||
OLS are the most basic and most common method to make a linear fit | ||
and is usually justified by referring to the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem) | ||
which says that ordinary least squares are the best linear unbiased estimator under certain conditions. | ||
|
||
In particular: | ||
We call an estimator unbiased when $\mathrm{E}[\hat{\beta}] = \beta$. | ||
We say it's best in the sense of [statistical efficiency](https://en.wikipedia.org/wiki/Efficiency_(statistics)), | ||
i.e. we want to minimize $\mathrm{Var}[\hat{\beta}]$. | ||
|
||
Have | ||
|
||
$$\begin{equation}\label{eq:linreg}y = X\beta + \varepsilon,\end{equation}$$ | ||
|
||
where $y,\varepsilon \in \mathbb{R}^n, \beta \in \mathbb{R}^k, X \in \mathbb{R}^{n\times k}$ | ||
and individual $\varepsilon_i$ are understood as random variables. | ||
Since $y$ depends on $\varepsilon$, can also be analyzed as a random variable and so can be any estimate $\hat{\beta}$ which depends on $y$. | ||
Assume the error terms $\varepsilon$ have zero mean, i.e. $\mathrm{E}[\varepsilon_i] = 0$, | ||
and all have the same variance and are uncorrelated i.e. $\mathrm{Cov}[e_i, e_j] = \sigma^2 \delta_{ij}$ where $\sigma \in \mathbb{R}^+$ | ||
(here using the Kronecker delta symbol). | ||
Gauss-Markov theorem says that under these conditions, | ||
the OLS estimator, given by | ||
|
||
$$\hat{\beta} = \arg\min_{\beta} \lVert X\beta - y \rVert_2^2 = \left(X^T X\right)^{-1} X^T y$$ | ||
|
||
is the best estimator among those which are unbiased. | ||
|
||
This is quite powerful, because it doesn't specify which distribution $\varepsilon$ has, only that it needs to have certain properties. | ||
|
||
But what if we decided we want to maximize efficiency, perhaps even at the cost of having the estimator biased? | ||
|
||
### MLE for linear regression | ||
|
||
Now, let's specify what distribution the error terms $\varepsilon$ have. | ||
The most common choice is: | ||
|
||
$$\begin{equation}\label{eq:normalerr} \varepsilon_i \sim \mathrm{Normal}(0, \sigma), \end{equation}$$ | ||
|
||
where $\sigma \in \mathbb{R}^+$ is constant. | ||
The assumptions of the Gauss–Markov theorem are satisfied. | ||
Coincidentally, it <span title="Cool exercise for univariate case">can be shown</span> | ||
that the maximum likelihood estimation of $\beta$ is in this case again OLS. | ||
|
||
But what if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution). | ||
|
||
$$ \varepsilon_i \sim \mathrm{Laplace}\left(0, \theta'\right)$$ | ||
|
||
Now the assumptions of the Gauss–Markov theorem are still satisfied, but the MLE is the least absolute deviations (LAD), | ||
in other words, we use the $l^1$ norm instead of $l^2$: | ||
|
||
$$\hat{\beta} = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$ | ||
|
||
Under usual circumstances, [MLE estimates are statistically efficient](https://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle/), at least asymptotically. | ||
The following plot shows some synthetic data with Laplace distributed noise around the red line. | ||
The fit obtained from the (biased) MLE is in this case closer to the true generating line on most of the interval, | ||
than the unbiased OLS suggested by Gauss–Markov. | ||
This is what would usually happen in such an experiment. | ||
|
||
![](/assets/images/linreg_ols_vs_mle.png "OLS vs MLE for Laplace distributed errors") | ||
|
||
(_Fun fact:_ If you want general $l^p$ norm, there is [Generalized normal distribution](https://en.wikipedia.org/wiki/Generalized_normal_distribution). | ||
The limit case for $p \to \infty$ with the max norm corresponds to uniform distribution of errors.) | ||
|
||
The takeaway here should be that if we aren't satisfied with the handwavy reference to CLT, | ||
and we have a reason to believe the distribution of the data is other than normal, | ||
we can likely do better than OLS. | ||
Note that the Laplace distribution has heavier tail than normal, | ||
so LAD may be practical to use in the presence of outliers in the data as it's more robust. | ||
|
||
## Generalized linear models | ||
With the Laplace distribution instead of normal, we have touched the idea of using different noise distribution. | ||
[Generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLM) take this one step further. | ||
First, consider the case with the normal distributed noise and make a few simple observations. | ||
Putting together $\eqref{eq:linreg}$ and $\eqref{eq:normalerr}$ can be written as: | ||
|
||
$$\begin{equation}\label{eq:normalerr2} y \sim \mathrm{Normal}(X\beta, \sigma \mathbf{1}) \end{equation}$$ | ||
|
||
and thus also trivially: | ||
|
||
$$ \mathrm{E}\left[y|X\right] = X\beta.$$ | ||
|
||
Now we want to generalize this to another distribution $\mathcal{D}(\theta, \theta'), \theta \in \Theta$. | ||
Basically we want to plug $X\beta$ into the distributions parameter which represents the mean value. | ||
A lot of distributions don't have that and may not even allow negative values though, | ||
so we allow a reparametrizazion function $g: \Theta \to \mathbb{R}$ (invertible), called the _link function_ and require: | ||
|
||
$$\begin{equation}\label{eq:link1}y \sim \mathcal{D}(g^{-1}(X\beta), \theta') \end{equation}$$ | ||
|
||
and | ||
|
||
$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y|X\right] = g^{-1}(X\beta). \end{equation}$$ | ||
|
||
(Don't ask me why the inverse link is used, it's just a weird convention.) | ||
The choice of the link function is kind of arbitrary as long as its domain is the parameter space, | ||
but it does have an effect on how the model performs and how it should be interpreted. | ||
The treatment of the extra parameters $\theta'$ may vary. | ||
In the simplest cases they don't exist or are they're set to a fixed known value. | ||
Otherwise, they may be treated as nuisance parameters and approached with profile likelihood (or marginalization in a Bayesian setting) | ||
or they may be fully fledged parameters estimated together with $\beta$, which then may make the estimation more difficult. | ||
|
||
When the distribution $\mathcal{D}$ is from the [exponential family](https://en.wikipedia.org/wiki/Exponential_family) | ||
and $\eqref{eq:link1},\eqref{eq:link2}$ hold, there are available good-performance algorithms to compute the likelihood and to maximize it. | ||
which are implemented in numerous libraries. | ||
|
||
A famous instance of a GLM is the logistic regression, where $\mathcal{D}(\theta), \theta \in (0,1)$ is the Bernoulli distribution, | ||
and we use the logit function as the link. So in terms of $\eqref{eq:link1}$, | ||
logistic regression can be characterized as the probabilistic model: | ||
|
||
$$ y \sim \mathrm{Bernoulli}\left(\frac{1}{1+e^{-X\beta}}\right) $$ | ||
|
||
When $y$ is binary, the Bernoulli distribution is the only option, but as I mentioned, | ||
we can choose a different link function. | ||
In this case, quantile function of normal distribution is another possible option which | ||
is also sometimes used as the link function, leading to the somewhat less popular [probit model](https://en.wikipedia.org/wiki/Probit_model). | ||
|
||
For discrete counts data $y$, we can use distributions like Poisson, binomial, geometric or negative binomial | ||
which are all in the exponential family and are thus GLM–compatible. | ||
|
||
## GLM - example with data | ||
The following plot shows a comparison of using the Poisson GLM with log-link versus OLS (a.k.a. Normal-Identity GLM) | ||
on some counts of daily purchases data. | ||
Although it may not be immediately obvious, the data do have some sort of trend growth and weekly periodicity. | ||
|
||
For both GLMs, there are the same three features: intercept, the time (horizontal axis), | ||
and a boolean indicator for weekends. | ||
|
||
![](/assets/images/normal_vs_poisson_glm.png "Normal vs Poisson GLM") | ||
|
||
A glaring (qualitative) issue with the OLS is that it can (and does near the axes origin) go | ||
into negative numbers, which would need to be solved using some ad-hoc technique like clipping which feels pretty arbitrary. | ||
The Poisson-log model will always produce a positive number. | ||
But it has its own (quantitative) issue, it's simplicity is at the cost of an exponential trend growth | ||
which is unrealistic to use for extrapolation beyond very short term. | ||
In this example, the Poisson-log model has marginally better quality metrics like RMSE, MAE. | ||
|
||
Where the qualitative difference between the models becomes more obvious is when we look at the confidence intervals. | ||
The following plot shows the intervals in which 90% of the most likely outcomes will be | ||
(exactly 90% for OLS, at least 90% for Poisson GLM due to its discrete nature.) | ||
|
||
![](/assets/images/normal_vs_poisson_glm_intervals.png "Normal vs Poisson GLM") | ||
|
||
Suppose a business person asks you, how many items is he going to sell tomorrow. | ||
Do you think he better likes the answer "between 0 and 3" or "between minus 1.9 and 4.33"? | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.