Skip to content

Commit

Permalink
added discussion on intervals
Browse files Browse the repository at this point in the history
  • Loading branch information
Jachym.Barvinek committed Dec 11, 2024
1 parent 319dcb7 commit 012e7ef
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 423 deletions.
2 changes: 1 addition & 1 deletion _posts/2024-10-16-waiting-times.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 'Waiting times'
layout: post
date: 2024-10-16
tags: [probability theory,time series]
tags: [probabilistic modelling,time series,event series]
pin: true
math: true
---
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-10-17-counting-events.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 'Counting events'
layout: post
date: 2024-10-17
tags: [probability theory,time series]
tags: [probabilistic modelling,time series,event series]
pin: true
math: true
---
Expand Down
12 changes: 6 additions & 6 deletions _posts/2024-12-04-inhomogeneous-poisson-process.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 'Inhomogeneous Poisson process & case study'
layout: post
date: 2024-12-04
tags: [probability theory,statistics,time series,case study]
tags: [probabilistic modelling,statistics,time series,case study,event series]
pin: true
math: true
---
Expand Down Expand Up @@ -43,7 +43,7 @@ Remember that the rate is also the expectation value, which is useful for predic
## Parameter estimation

From the statistical perspective, how do we fit this $\lambda(t)$ from data parametrically?
We first need to assume the function $\lambda(t|\mathbf{\theta})$ is given by a certain formula,
We first need to assume the function $\lambda(t\,|\,\mathbf{\theta})$ is given by a certain formula,
where we want to find the parameters $\mathbf{\theta}$.
Choice of this formula can be seen as specifications of further assumptions about the system
and that should usually follow some exploratory analysis of the data.
Expand All @@ -55,7 +55,7 @@ Fortunately, even one realization may be enough to reasonably learn,
if there are enough event records compared to the number of parameters.
The log-likelihood can be shown<sup>[1],[2]</sup> to be:

$$\mathscr{l}(\mathbf{t}|\mathbf{\theta}) = \sum_{i=1}^n \log \lambda(t_i|\mathbf{\theta}) - \int_0^T \lambda(s|\mathbf{\theta}) \mathrm{d}s$$
$$\mathscr{l}(\mathbf{t}\,|\,\mathbf{\theta}) = \sum_{i=1}^n \log \lambda(t_i\,|\,\mathbf{\theta}) - \int_0^T \lambda(s\,|\,\mathbf{\theta}) \mathrm{d}s$$

Intuitively, the sum accounts for the events that have occurred and the integral accounts for the events that did not occur, but could have.
For more complicated functions, the integral will likely not have an analytic solution
Expand All @@ -69,7 +69,7 @@ It may be even clearer when we look at the aggregated hourly counts.

A simple three-parameter elementary function for this case may look like:

$$\lambda(t|\alpha,\varphi,\sigma) = \alpha \cdot \exp^{\sigma}(\cos(t-\varphi)-1)$$
$$\lambda(t\,|\,\alpha,\varphi,\sigma) = \alpha \cdot \exp^{\sigma}(\cos(t-\varphi)-1)$$

Where $\sigma \in \mathbb{R}^+$ controls the "width" of the bump, $\alpha \in \mathbb{R}^+$ is the amplitude
and $\varphi \in (-\pi,\pi)$ is the phase shift.
Expand All @@ -83,12 +83,12 @@ The definite integral here, assuming additionally that $T = 2\pi k, k \in \mathb
does not depend on $\varphi$ and
can be expressed in terms of the modified Bessel function of the first kind ([proof](https://www.smbc-comics.com/comic/2013-01-20)):

$$\int_0^T \lambda(s|\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} \cdot I_0(\sigma)$$
$$\int_0^T \lambda(s\,|\,\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} \cdot I_0(\sigma)$$

although this is not elementary, it's established enough to be implemented in various libraries where you can also do optimization.
And just to be clear, the sum in the log-likelihood in this case simplifies (trivially) to:

$$\sum_{i=1}^n \log \lambda(t_i|\alpha,\varphi,\sigma) = n(\log(\alpha)-\sigma) + \sigma \sum_{i=1}^n \cos(t_i-\varphi)$$
$$\sum_{i=1}^n \log \lambda(t_i\,|\,\alpha,\varphi,\sigma) = n(\log(\alpha)-\sigma) + \sigma \sum_{i=1}^n \cos(t_i-\varphi)$$

Now we can find the parameter values using our favorite numerical optimizer (see Appendix).
The results are shown on the following plots.
Expand Down
23 changes: 10 additions & 13 deletions _posts/2024-12-07-glms.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 'From linear regression to GLMs – theory and examples'
layout: post
date: 2024-12-07
tags: [statistics,time series,regression,case study]
tags: [statistics,time series,regression,case study,probabilistic modelling]
pin: true
math: true
excerpt_separator: '<!--more-->'
Expand All @@ -25,13 +25,9 @@ For small counts, the distinction between normal and a discrete counting distrib
Generalized linear models (GLMs) max fix this and similar issues by allowing other kinds of data distributions
while keeping a lot of the desirable and familiar properties of linear regression.

## Linear regression
<span title="who I expect to read this blog">Everybody<span>
knows good-old linear regression.
But it's a good point where to start with the topic, so let's recap a bit.
First, let's study the relation of linear regression and ordinary least squares (OLS).

### Gauss–Markov theorem
## Gauss–Markov theorem
OLS are the most basic and most common method to make a linear fit
and is usually justified by referring to the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)
which says that ordinary least squares are the best linear unbiased estimator under certain conditions.
Expand Down Expand Up @@ -62,7 +58,7 @@ This is quite powerful, because it doesn't specify which distribution $\varepsil

But what if we decided we want to maximize efficiency, perhaps even at the cost of having the estimator biased?

### MLE for linear regression
## MLE for linear regression

To do MLE, we need a distribution.
So let's now specify what distribution the error terms $\varepsilon$ have.
Expand All @@ -74,15 +70,16 @@ where $\sigma \in \mathbb{R}^+$ is constant.
The assumptions of the Gauss–Markov theorem are satisfied.
Coincidentally, it <span title="Cool exercise for univariate case">can be shown</span>
that the maximum likelihood estimation of $\beta$ is in this case again OLS.
This may seem unsurprising, but let's have a look at another example.

But what if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution).
What if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution) (a.k.a. double exponential).

$$ \varepsilon_i \sim \mathrm{Laplace}\left(0, \; \theta'\right)$$

Now the assumptions of the Gauss–Markov theorem are still satisfied, but the MLE is the least absolute deviations (LAD),
in other words, we use the $l^1$ norm instead of $l^2$:

$$\hat{\beta} = \arg\min_{\beta}\mathscr{l}(X,y|\beta) = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$
$$\hat{\beta} = \arg\min_{\beta}\mathscr{l}(X,y\,|\,\beta) = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$

Under usual circumstances, [MLE estimates are statistically efficient](https://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle/), at least asymptotically.
The following plot shows some synthetic data with Laplace distributed noise around the red line.
Expand Down Expand Up @@ -113,7 +110,7 @@ $$\begin{equation}\label{eq:normalerr2} y \sim \mathrm{Normal}\left(X\beta, \; \

and thus also trivially:

$$ \mathrm{E}\left[y|X\right] = X\beta.$$
$$ \mathrm{E}\left[y\,|\,X\right] = X\beta.$$

Now we want to generalize this to another distribution $\mathcal{D}\left(\theta, \theta'\right), \theta \in \Theta$.
Basically we want to plug $X\beta$ into the distributions parameter which represents the mean value.
Expand All @@ -124,7 +121,7 @@ $$\begin{equation}\label{eq:link1}y \sim \mathcal{D}\left(g^{-1}\left(X\beta\rig

and

$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y|X\right] = g^{-1}(X\beta). \end{equation}$$
$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y\,|\,X\right] = g^{-1}(X\beta). \end{equation}$$

(Don't ask me why the inverse link is used in most places, it's just a weird convention.)
The choice of the link function is kind of arbitrary as long as its domain is the parameter space,
Expand Down Expand Up @@ -212,7 +209,7 @@ The prediction interval asks "What is the range of a future observed value $\hat
Here $X$ is the past observed data matrix and $x \in \mathbb{R}^k$ is a new vector of features for which we're making a prediction.
To answer this we will identify the distribution of $\hat{y} \in \mathbb{R}$. In this case, we have:

$$y|x,\hat{\beta},\hat{\sigma} \sim \mathrm{Normal}(x^T\hat{\beta},\; \hat{\sigma}^2)$$
$$y\,|\,x,\hat{\beta},\hat{\sigma} \sim \mathrm{Normal}(x^T\hat{\beta},\; \hat{\sigma}^2)$$

where we need to additionally specify that the MLE of $\hat{\sigma}^2$,
which is given by the usual population variance formula:
Expand Down Expand Up @@ -255,7 +252,7 @@ $$\lim_{n\to\infty} \hat{y} \sim \mathcal{D}\left(g^{-1}\left(x^T\beta\right), \
So for a confidence level $\gamma \in (0, 1)$, the approximate prediction interval is the smallest set $C$
such that

$$\int_C p_{\mathcal{D}}\left(y\;|\;g^{-1}\left(x^T\beta\right),\; \theta'\right) \, \mathrm{d}y \le \gamma$$
$$\int_C p_{\mathcal{D}}\left(y\,|\,g^{-1}\left(x^T\beta\right),\; \theta'\right) \, \mathrm{d}y \le \gamma$$

and analogically a sum for discrete distribution.
Obviously, in practice, we must substitute the estimated values in place of $\beta, \theta$.
Expand Down
Loading

0 comments on commit 012e7ef

Please sign in to comment.