added discussion on intervals

jachymb · Dec 11, 2024 · 012e7ef · 012e7ef
1 parent 319dcb7
commit 012e7ef
Show file tree

Hide file tree

Showing 5 changed files with 23 additions and 423 deletions.
diff --git a/_posts/2024-10-16-waiting-times.md b/_posts/2024-10-16-waiting-times.md
@@ -2,7 +2,7 @@
 title: 'Waiting times'
 layout: post
 date: 2024-10-16
-tags: [probability theory,time series]
+tags: [probabilistic modelling,time series,event series]
 pin: true
 math: true
 ---

diff --git a/_posts/2024-10-17-counting-events.md b/_posts/2024-10-17-counting-events.md
@@ -2,7 +2,7 @@
 title: 'Counting events'
 layout: post
 date: 2024-10-17
-tags: [probability theory,time series]
+tags: [probabilistic modelling,time series,event series]
 pin: true
 math: true
 ---

diff --git a/_posts/2024-12-04-inhomogeneous-poisson-process.md b/_posts/2024-12-04-inhomogeneous-poisson-process.md
@@ -2,7 +2,7 @@
 title: 'Inhomogeneous Poisson process & case study'
 layout: post
 date: 2024-12-04
-tags: [probability theory,statistics,time series,case study]
+tags: [probabilistic modelling,statistics,time series,case study,event series]
 pin: true
 math: true
 ---
@@ -43,7 +43,7 @@ Remember that the rate is also the expectation value, which is useful for predic
 ## Parameter estimation
 
 From the statistical perspective, how do we fit this $\lambda(t)$ from data parametrically?
-We first need to assume the function $\lambda(t|\mathbf{\theta})$ is given by a certain formula,
+We first need to assume the function $\lambda(t\,|\,\mathbf{\theta})$ is given by a certain formula,
 where we want to find the parameters $\mathbf{\theta}$.
 Choice of this formula can be seen as specifications of further assumptions about the system 
 and that should usually follow some exploratory analysis of the data.
@@ -55,7 +55,7 @@ Fortunately, even one realization may be enough to reasonably learn,
 if there are enough event records compared to the number of parameters.
 The log-likelihood can be shown<sup>[1],[2]</sup> to be:
 
-$$\mathscr{l}(\mathbf{t}|\mathbf{\theta}) = \sum_{i=1}^n \log \lambda(t_i|\mathbf{\theta}) - \int_0^T \lambda(s|\mathbf{\theta}) \mathrm{d}s$$
+$$\mathscr{l}(\mathbf{t}\,|\,\mathbf{\theta}) = \sum_{i=1}^n \log \lambda(t_i\,|\,\mathbf{\theta}) - \int_0^T \lambda(s\,|\,\mathbf{\theta}) \mathrm{d}s$$
 
 Intuitively, the sum accounts for the events that have occurred and the integral accounts for the events that did not occur, but could have. 
 For more complicated functions, the integral will likely not have an analytic solution 
@@ -69,7 +69,7 @@ It may be even clearer when we look at the aggregated hourly counts.
 
 A simple three-parameter elementary function for this case may look like:
 
-$$\lambda(t|\alpha,\varphi,\sigma) = \alpha \cdot \exp^{\sigma}(\cos(t-\varphi)-1)$$
+$$\lambda(t\,|\,\alpha,\varphi,\sigma) = \alpha \cdot \exp^{\sigma}(\cos(t-\varphi)-1)$$
 
 Where $\sigma \in \mathbb{R}^+$ controls the "width" of the bump, $\alpha \in \mathbb{R}^+$ is the amplitude
 and $\varphi \in (-\pi,\pi)$ is the phase shift.
@@ -83,12 +83,12 @@ The definite integral here, assuming additionally that $T = 2\pi k, k \in \mathb
 does not depend on $\varphi$ and
 can be expressed in terms of the modified Bessel function of the first kind ([proof](https://www.smbc-comics.com/comic/2013-01-20)):
 
-$$\int_0^T \lambda(s|\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} \cdot I_0(\sigma)$$
+$$\int_0^T \lambda(s\,|\,\alpha,\varphi,\sigma) \mathrm{d}s = T \cdot \alpha \cdot e^{-\sigma} \cdot I_0(\sigma)$$
 
 although this is not elementary, it's established enough to be implemented in various libraries where you can also do optimization.
 And just to be clear, the sum in the log-likelihood in this case simplifies (trivially) to:
 
-$$\sum_{i=1}^n \log \lambda(t_i|\alpha,\varphi,\sigma) = n(\log(\alpha)-\sigma) + \sigma \sum_{i=1}^n \cos(t_i-\varphi)$$
+$$\sum_{i=1}^n \log \lambda(t_i\,|\,\alpha,\varphi,\sigma) = n(\log(\alpha)-\sigma) + \sigma \sum_{i=1}^n \cos(t_i-\varphi)$$
 
 Now we can find the parameter values using our favorite numerical optimizer (see Appendix). 
 The results are shown on the following plots. 

diff --git a/_posts/2024-12-07-glms.md b/_posts/2024-12-07-glms.md
@@ -2,7 +2,7 @@
 title: 'From linear regression to GLMs – theory and examples'
 layout: post
 date: 2024-12-07
-tags: [statistics,time series,regression,case study]
+tags: [statistics,time series,regression,case study,probabilistic modelling]
 pin: true
 math: true
 excerpt_separator: '<!--more-->'
@@ -25,13 +25,9 @@ For small counts, the distinction between normal and a discrete counting distrib
 Generalized linear models (GLMs) max fix this and similar issues by allowing other kinds of data distributions 
 while keeping a lot of the desirable and familiar properties of linear regression.
 
-## Linear regression
-<span title="who I expect to read this blog">Everybody<span>
-knows good-old linear regression.
-But it's a good point where to start with the topic, so let's recap a bit.
 First, let's study the relation of linear regression and ordinary least squares (OLS). 
 
-### Gauss–Markov theorem
+## Gauss–Markov theorem
 OLS are the most basic and most common method to make a linear fit
 and is usually justified by referring to the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)
 which says that ordinary least squares are the best linear unbiased estimator under certain conditions. 
@@ -62,7 +58,7 @@ This is quite powerful, because it doesn't specify which distribution $\varepsil
 
 But what if we decided we want to maximize efficiency, perhaps even at the cost of having the estimator biased? 
 
-### MLE for linear regression
+## MLE for linear regression
 
 To do MLE, we need a distribution.
 So let's now specify what distribution the error terms $\varepsilon$ have.
@@ -74,15 +70,16 @@ where $\sigma \in \mathbb{R}^+$ is constant.
 The assumptions of the Gauss–Markov theorem are satisfied.
 Coincidentally, it <span title="Cool exercise for univariate case">can be shown</span>
 that the maximum likelihood estimation of $\beta$ is in this case again OLS.
+This may seem unsurprising, but let's have a look at another example.
 
-But what if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution).
+What if the errors have a different distribution? For example [Laplace](https://en.wikipedia.org/wiki/Laplace_distribution) (a.k.a. double exponential).
 
 $$ \varepsilon_i \sim \mathrm{Laplace}\left(0, \; \theta'\right)$$
 
 Now the assumptions of the Gauss–Markov theorem are still satisfied, but the MLE is the least absolute deviations (LAD),
 in other words, we use the $l^1$ norm instead of $l^2$:
 
-$$\hat{\beta} = \arg\min_{\beta}\mathscr{l}(X,y|\beta) = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$
+$$\hat{\beta} = \arg\min_{\beta}\mathscr{l}(X,y\,|\,\beta) = \arg\min_{\beta} \lVert X\beta - y \rVert_1 = \arg\min_{\beta} \sum_i | X_i \beta - y_i|$$
 
 Under usual circumstances, [MLE estimates are statistically efficient](https://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle/), at least asymptotically.
 The following plot shows some synthetic data with Laplace distributed noise around the red line.
@@ -113,7 +110,7 @@ $$\begin{equation}\label{eq:normalerr2} y \sim \mathrm{Normal}\left(X\beta, \; \
 
 and thus also trivially:
 
-$$ \mathrm{E}\left[y|X\right] = X\beta.$$
+$$ \mathrm{E}\left[y\,|\,X\right] = X\beta.$$
 
 Now we want to generalize this to another distribution $\mathcal{D}\left(\theta, \theta'\right), \theta \in \Theta$.
 Basically we want to plug $X\beta$ into the distributions parameter which represents the mean value. 
@@ -124,7 +121,7 @@ $$\begin{equation}\label{eq:link1}y \sim \mathcal{D}\left(g^{-1}\left(X\beta\rig
 
 and
 
-$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y|X\right] = g^{-1}(X\beta). \end{equation}$$
+$$\begin{equation}\label{eq:link2} \mathrm{E}\left[y\,|\,X\right] = g^{-1}(X\beta). \end{equation}$$
 
 (Don't ask me why the inverse link is used in most places, it's just a weird convention.) 
 The choice of the link function is kind of arbitrary as long as its domain is the parameter space,
@@ -212,7 +209,7 @@ The prediction interval asks "What is the range of a future observed value $\hat
 Here $X$ is the past observed data matrix and $x \in \mathbb{R}^k$ is a new vector of features for which we're making a prediction. 
 To answer this we will identify the distribution of $\hat{y} \in \mathbb{R}$. In this case, we have:
 
-$$y|x,\hat{\beta},\hat{\sigma} \sim \mathrm{Normal}(x^T\hat{\beta},\; \hat{\sigma}^2)$$
+$$y\,|\,x,\hat{\beta},\hat{\sigma} \sim \mathrm{Normal}(x^T\hat{\beta},\; \hat{\sigma}^2)$$
 
 where we need to additionally specify that the MLE of $\hat{\sigma}^2$,
 which is given by the usual population variance formula:
@@ -255,7 +252,7 @@ $$\lim_{n\to\infty} \hat{y} \sim \mathcal{D}\left(g^{-1}\left(x^T\beta\right), \
 So for a confidence level $\gamma \in (0, 1)$, the approximate prediction interval is the smallest set $C$
 such that 
 
-$$\int_C p_{\mathcal{D}}\left(y\;|\;g^{-1}\left(x^T\beta\right),\; \theta'\right) \, \mathrm{d}y \le \gamma$$
+$$\int_C p_{\mathcal{D}}\left(y\,|\,g^{-1}\left(x^T\beta\right),\; \theta'\right) \, \mathrm{d}y \le \gamma$$
 
 and analogically a sum for discrete distribution. 
 Obviously, in practice, we must substitute the estimated values in place of $\beta, \theta$.