glms

jachymb · Dec 8, 2024 · e764864 · e764864
1 parent ee5497f
commit e764864
Show file tree

Hide file tree

Showing 8 changed files with 1,734 additions and 89 deletions.
diff --git a/_config.yml b/_config.yml
@@ -16,7 +16,7 @@ timezone: "Europe/Prague"
 
 title: Jáchym Barvínek # the main title
 
-tagline: tech blog # it will display as the sub-title
+tagline: data science tech blog # it will display as the sub-title
 
 description: # used by seo meta and the atom feed
 #  A minimal, responsive and feature-rich Jekyll theme for technical writing.
@@ -86,7 +86,7 @@ pageviews:
 #     light  - Use the light color scheme
 #     dark   - Use the dark color scheme
 #
-theme_mode: # [light | dark]
+theme_mode: #[light | dark]
 
 # The CDN endpoint for media resources.
 # Notice that once it is assigned, the CDN url

diff --git a/_posts/2024-12-04-inhomogeneous-poisson-process.md b/_posts/2024-12-04-inhomogeneous-poisson-process.md
@@ -1,8 +1,8 @@
 ---
-title: 'Inhomogeneous Poisson process modelling - case study'
+title: 'Inhomogeneous Poisson process & case study'
 layout: post
 date: 2024-12-04
-tags: [probability theory]
+tags: [probability theory,statistics,case study]
 pin: true
 math: true
 ---
@@ -43,22 +43,23 @@ Remember that the rate is also the expectation value, which is useful for predic
 ## Parameter estimation
 
 From the statistical perspective, how do we fit this $\lambda(t)$ from data parametrically?
-We first need to assume the function $\lambda(t|\theta)$ is given by a certain formula,
-where we want to find the parameters $\theta$.
+We first need to assume the function $\lambda(t|\mathbf{\theta})$ is given by a certain formula,
+where we want to find the parameters $\mathbf{\theta}$.
 Choice of this formula can be seen as specifications of further assumptions about the system 
 and that should usually follow some exploratory analysis of the data.
 
 As usual with parameter learning, we use maximum likelihood estimation. But what is the likelihood function here?
-We should note that one sample is not a single event occurrence $t_i$ (since these are _not_ identically distributed),
+We should note that one sample is not a single event occurrence $t_i$ (since their differences are _not_ identically distributed),
 but rather, the interval $(0, T)$ with the entire realization $\mathbf{t} = (t_1, ..., t_n)$ where each $t_i \in I$.
 Fortunately, even one realization may be enough to reasonably learn, 
 if there are enough event records compared to the number of parameters.
 The log-likelihood can be shown<sup>[1],[2]</sup> to be:
 
-$$\mathcal{l}(\mathbf{t}|\theta) = \sum_{i=1}^n \log \lambda(t_i|\theta) - \int_0^T \lambda(s|\theta) \mathrm{d}s$$
+$$\mathcal{l}(\mathbf{t}|\mathbf{\theta}) = \sum_{i=1}^n \log \lambda(t_i|\mathbf{\theta}) - \int_0^T \lambda(s|\mathbf{\theta}) \mathrm{d}s$$
 
-Intuitively, the sum accounts for the events that occurred and the integral accounts for the events that did not occur, but could have. 
-For more complicated functions, the integral will likely not have an analytic solution and we the need to resort to numerical methods to evaluate the likelihood.
+Intuitively, the sum accounts for the events that have occurred and the integral accounts for the events that did not occur, but could have. 
+For more complicated functions, the integral will likely not have an analytic solution 
+and we the need to resort to numerical methods to evaluate the likelihood or use sampling algorithms to find the parameters.
 
 ## Modelling – case study
 Briefly looking at the previous plot, I have mentioned that there is some sort of periodicity visible. 
@@ -73,20 +74,23 @@ $$\lambda(t|\alpha,\varphi,\sigma) = \alpha \cdot \exp^{\sigma}(\cos(t-\varphi)-
 Where $\sigma \in \mathbb{R}^+$ controls the "width" of the bump, $\alpha \in \mathbb{R}^+$ is the amplitude
 and $\varphi \in (-\pi,\pi)$ is the phase shift.
 Of course, we assume the data have been scaled so that the length of one day is $2\pi$.
-The exponential is here to guarantee rate positivity and is somewhat convenient to work with.
+The exponential is here to guarantee rate positivity and is somewhat convenient to work with, the $-1$ in the $\cos$ isn't even necessary, but makes $\alpha$ directly interpretable as peak height.
+This function is inspired by the [von Mises distribution](https://en.wikipedia.org/wiki/Von_Mises_distribution) density for circular data.
 I am not arguing this is the best choice here, just that it's _a choice_ that 
 we can reasonably expect to give better predictive results for this particular data than just using the homogeneous process model.
 
-The definite integral here can be expressed in terms of the modified Bessel function of the first kind ([proof](https://www.smbc-comics.com/comic/2013-01-20)):
+The definite integral here, assuming additionally that $T = 2\pi k, k \in \mathbb{N}$ (i.e. we are analyzing a whole number $k$ of days),
+does not depend on $\varphi$ and
+can be expressed in terms of the modified Bessel function of the first kind ([proof](https://www.smbc-comics.com/comic/2013-01-20)):
 
-$$\int_0^T \lambda(s|\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} I_0(\sigma)$$
+$$\int_0^T \lambda(s|\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} \cdot I_0(\sigma)$$
 
 although this is not elementary, it's established enough to be implemented in various libraries where you can also do optimization.
 And just to be clear, the sum in the log-likelihood in this case simplifies (trivially) to:
 
 $$\sum_{i=1}^n \log \lambda(t_i|\alpha,\varphi,\sigma) = n(\log(\alpha)-\sigma) + \sigma \sum_{i=1}^n \cos(t_i-\varphi)$$
 
-Now we can find the parameter values using our favorite optimizer (see Appendix). 
+Now we can find the parameter values using our favorite numerical optimizer (see Appendix). 
 The results are shown on the following plots. 
 Visually, it seems not great not terrible? 
 But I think this is decent considering I'm only using three parameters.
@@ -112,28 +116,38 @@ where we can use them to express arbitrary periodicities, which provides a natur
 ## Appendix – Code
 Optimization in Wolfram Script is short code:
 ```wolfram
-n = Length[data];
-loglike = sigma*Total[Cos[data - phi]] + (Log[alpha]-sigma)*n - alpha*T*Exp[-sigma]*BesselI[0, sigma];
+T = 2*k*Pi;
+loglike = (
+  sigma*Total[Cos[data - phi]]
+  + (Log[alpha]-sigma)*Length[data] 
+  - alpha*T*Exp[-sigma]*BesselI[0, sigma]
+);
 constraints = sigma > 0 && alpha > 0 && -Pi < phi <= Pi;
 NMaximize[{loglike, constraints}, {alpha, sigma, phi}]
+
 ```
 
 But I like STAN probabilistic modelling, even though the code is more verbose, it can be more naturally integrated with Bayesian techniques.
 
 ```stan
 functions {
   real my_process_lpdf(data vector t, data real T, real alpha, real phi, real sigma) {
-    return sigma*sum(cos(t-phi)) + (log(alpha)-sigma)*size(t)
+    return sigma*sum(cos(t-phi)) 
+         + (log(alpha)-sigma)*size(t)
          - alpha*T*exp(-sigma)*modified_bessel_first_kind(0, sigma);
   }
 }
 
 data {
   int<lower=0> N;
-  real<lower=0> T;
+  int<lower=0> k;
   vector<lower=0,upper=T>[N] t;
 }
 
+transformed data {
+  real T = 2*k*pi();
+}
+
 parameters {
   real<lower=0> alpha;
   real<lower=0> sigma;
@@ -144,3 +158,5 @@ model {
     t ~ my_process(T, alpha, phi, sigma); 
 }
 ```
+
+[Notebook for plots.](/assets/notebooks/imhomogeneous_process.ipynb)
diff --git a/_posts/2024-12-07-glms.md b/_posts/2024-12-07-glms.md
@@ -0,0 +1,175 @@
+---
+title: 'Generalized linear models – theory and data examples'
+layout: post
+date: 2024-12-07
+tags: [probability theory,statistics,case study]
+pin: true
+math: true
+---
+
+This article is a sort of introduction to generalized linear models,
+explaining the basic theory how exactly they relate to linear regression 
+and illustrate why they may be useful on some data examples.
+
+One important case where linear regression fails when modelling time series is when the data are positive integers close to zero
+(e.g. daily counts of purchases of a low-selling product.) 
+This is because there are certain assumptions encoded in the model, 
+that we often justify with some handwavy reference
+to the central limit theorem.
+But in this case it may easily lead to poor results: 
+For small counts, the distinction between normal and a discrete counting distribution will be significant.
+(See the article about [counting events](/posts/counting-events) for a more detailed discussion.)
+
+Generalized linear models (GLMs) max fix this and similar issues by allowing other kinds of data distributions 
+while keeping a lot of the desirable and familiar properties of linear regression.
+
+## Linear regression
+<span title="who I expect to read this blog">Everybody<span>
+knows good-old linear regression.
+But it's a good point where to start with the topic, so let's recap a bit.
+First, let's study the relation of linear regression and ordinary least squares (OLS). 
+
+### Gauss–Markov theorem
+OLS are the most basic and most common method to make a linear fit
+and is usually justified by referring to the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)
+which says that ordinary least squares are the best linear unbiased estimator under certain conditions. 
+
+In particular:
+We call an estimator unbiased when $\mathrm{E}[\hat{\beta}] = \beta$. 
+We say it's best in the sense of [statistical efficiency](https://en.wikipedia.org/wiki/Efficiency_(statistics)),
+i.e. we want to minimize $\mathrm{Var}[\hat{\beta}]$.
+
+Have 
+
+$$\begin{equation}\label{eq:linreg}y = X\beta + \varepsilon,\end{equation}$$
+
+where $y,\varepsilon \in \mathbb{R}^n, \beta \in \mathbb{R}^k, X \in \mathbb{R}^{n\times k}$
+and individual $\varepsilon_i$ are understood as random variables.
+Since $y$ depends on $\varepsilon$, can also be analyzed as a random variable and so can be any estimate $\hat{\beta}$ which depends on $y$.
+Assume the error terms $\varepsilon$ have zero mean, i.e. $\mathrm{E}[\varepsilon_i] = 0$,
+and all have the same variance and are uncorrelated i.e. $\mathrm{Cov}[e_i, e_j] = \sigma^2 \delta_{ij}$ where $\sigma \in \mathbb{R}^+$ 
+(here using the Kronecker delta symbol).
+Gauss-Markov theorem says that under these conditions,
+the OLS estimator, given by 
+
+$$\hat{\beta} = \arg\min_{\beta} \lVert X\beta - y \rVert_2^2 = \left(X^T X\right)^{-1} X^T y$$
+
+is the best estimator among those which are unbiased.
+
+This is quite powerful, because it doesn't specify which distribution $\varepsilon$ has, only that it needs to have certain properties.
+
+But what if we decided we want to maximize efficiency, perhaps even at the cost of having the estimator biased? 
+
+### MLE for linear regression
+
+Now, let's specify what distribution the error terms $\varepsilon$ have.
+The most common choice is:
+
+$$\begin{equation}\label{eq:normalerr} \varepsilon_i \sim \mathrm{Normal}(0, \sigma), \end{equation}$$
+
+where $\sigma \in \mathbb{R}^+$ is constant.
+The assumptions of the Gauss–Markov theorem are satisfied.
+Coincidentally, it <span title="Cool exercise for univariate case">can be shown</span>
+that the maximum likelihood estimation of $\beta$ is in this case again OLS.
+
+But what if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution).
+
+$$ \varepsilon_i \sim \mathrm{Laplace}\left(0, \theta'\right)$$
+
+Now the assumptions of the Gauss–Markov theorem are still satisfied, but the MLE is the least absolute deviations (LAD),
+in other words, we use the $l^1$ norm instead of $l^2$:
+
+$$\hat{\beta} = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$
+
+Under usual circumstances, [MLE estimates are statistically efficient](https://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle/), at least asymptotically.
+The following plot shows some synthetic data with Laplace distributed noise around the red line.
+The fit obtained from the (biased) MLE is in this case closer to the true generating line on most of the interval,
+than the unbiased OLS suggested by Gauss–Markov.
+This is what would usually happen in such an experiment.
+
+![](/assets/images/linreg_ols_vs_mle.png "OLS vs MLE for Laplace distributed errors")
+
+(_Fun fact:_ If you want general $l^p$ norm, there is [Generalized normal distribution](https://en.wikipedia.org/wiki/Generalized_normal_distribution).
+The limit case for $p \to \infty$ with the max norm corresponds to uniform distribution of errors.)
+
+The takeaway here should be that if we aren't satisfied with the handwavy reference to CLT, 
+and we have a reason to believe the distribution of the data is other than normal, 
+we can likely do better than OLS.
+Note that the Laplace distribution has heavier tail than normal, 
+so LAD may be practical to use in the presence of outliers in the data as it's more robust.
+
+## Generalized linear models
+With the Laplace distribution instead of normal, we have touched the idea of using different noise distribution.
+[Generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLM) take this one step further. 
+First, consider the case with the normal distributed noise and make a few simple observations.
+Putting together $\eqref{eq:linreg}$ and $\eqref{eq:normalerr}$ can be written as:
+
+$$\begin{equation}\label{eq:normalerr2} y \sim \mathrm{Normal}(X\beta, \sigma \mathbf{1}) \end{equation}$$
+
+and thus also trivially:
+
+$$ \mathrm{E}\left[y|X\right] = X\beta.$$
+
+Now we want to generalize this to another distribution $\mathcal{D}(\theta, \theta'), \theta \in \Theta$.
+Basically we want to plug $X\beta$ into the distributions parameter which represents the mean value. 
+A lot of distributions don't have that and may not even allow negative values though,
+so we allow a reparametrizazion function $g: \Theta \to \mathbb{R}$ (invertible), called the _link function_ and require:
+
+$$\begin{equation}\label{eq:link1}y \sim \mathcal{D}(g^{-1}(X\beta), \theta') \end{equation}$$
+
+and
+
+$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y|X\right] = g^{-1}(X\beta). \end{equation}$$
+
+(Don't ask me why the inverse link is used, it's just a weird convention.) 
+The choice of the link function is kind of arbitrary as long as its domain is the parameter space,
+but it does have an effect on how the model performs and how it should be interpreted.
+The treatment of the extra parameters $\theta'$ may vary. 
+In the simplest cases they don't exist or are they're set to a fixed known value.
+Otherwise, they may be treated as nuisance parameters and approached with profile likelihood (or marginalization in a Bayesian setting)
+or they may be fully fledged parameters estimated together with $\beta$, which then may make the estimation more difficult.
+
+When the distribution $\mathcal{D}$ is from the [exponential family](https://en.wikipedia.org/wiki/Exponential_family)
+and $\eqref{eq:link1},\eqref{eq:link2}$ hold, there are available good-performance algorithms to compute the likelihood and to maximize it. 
+which are implemented in numerous libraries.
+
+A famous instance of a GLM is the logistic regression, where $\mathcal{D}(\theta), \theta \in (0,1)$ is the Bernoulli distribution, 
+and we use the logit function as the link. So in terms of $\eqref{eq:link1}$,
+logistic regression can be characterized as the probabilistic model:
+
+$$ y \sim \mathrm{Bernoulli}\left(\frac{1}{1+e^{-X\beta}}\right) $$
+
+When $y$ is binary, the Bernoulli distribution is the only option, but as I mentioned, 
+we can choose a different link function. 
+In this case, quantile function of normal distribution is another possible option which
+is also sometimes used as the link function, leading to the somewhat less popular [probit model](https://en.wikipedia.org/wiki/Probit_model).
+
+For discrete counts data $y$, we can use distributions like Poisson, binomial, geometric or negative binomial
+which are all in the exponential family and are thus GLM–compatible. 
+
+## GLM - example with data  
+The following plot shows a comparison of using the Poisson GLM with log-link versus OLS (a.k.a. Normal-Identity GLM) 
+on some counts of daily purchases data.
+Although it may not be immediately obvious, the data do have some sort of trend growth and weekly periodicity.
+
+For both GLMs, there are the same three features: intercept, the time (horizontal axis),
+and a boolean indicator for weekends. 
+
+![](/assets/images/normal_vs_poisson_glm.png "Normal vs Poisson GLM")
+
+A glaring (qualitative) issue with the OLS is that it can (and does near the axes origin) go 
+into negative numbers, which would need to be solved using some ad-hoc technique like clipping which feels pretty arbitrary.
+The Poisson-log model will always produce a positive number.
+But it has its own (quantitative) issue, it's simplicity is at the cost of an exponential trend growth
+which is unrealistic to use for extrapolation beyond very short term. 
+In this example, the Poisson-log model has marginally better quality metrics like RMSE, MAE. 
+
+Where the qualitative difference between the models becomes more obvious is when we look at the confidence intervals.
+The following plot shows the intervals in which 90% of the most likely outcomes will be
+(exactly 90% for OLS, at least 90% for Poisson GLM due to its discrete nature.)
+
+![](/assets/images/normal_vs_poisson_glm_intervals.png "Normal vs Poisson GLM")
+
+Suppose a business person asks you, how many items is he going to sell tomorrow. 
+Do you think he better likes the answer "between 0 and 3" or "between minus 1.9 and 4.33"?
+
diff --git a/assets/images/linreg_ols_vs_mle.png b/assets/images/linreg_ols_vs_mle.png
diff --git a/assets/images/normal_vs_poisson_glm.png b/assets/images/normal_vs_poisson_glm.png
diff --git a/assets/images/normal_vs_poisson_glm_intervals.png b/assets/images/normal_vs_poisson_glm_intervals.png