generated from cotes2020/chirpy-starter
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jachym.Barvinek
committed
Oct 17, 2024
1 parent
7862aec
commit 8790ea5
Showing
4 changed files
with
245 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
--- | ||
title: 'Building a probabilistic time series model from scratch: Chapter 0 – Waiting times' | ||
layout: post | ||
date: 2024-10-16 | ||
categories: [] | ||
tags: [] | ||
pin: true | ||
math: true | ||
--- | ||
|
||
Eventually, in this series, we will build a rather elaborate hierarchical predictive time-series model. The motivating application in this series is modeling e-commerce sales. However, I don't just want to present the final result. Rather, let's explore the entire thought process that leads us there, starting from the very basics. | ||
|
||
On the lowest level of data available for time series modeling, we have timestamps of events, possibly with some annotations (e.g., type of product, amount). In practice, we often work with pre-aggregated data, such as total daily sales. However, if we want to be explicit about all the assumptions we are making, we should consider the individual events as the elementary units. | ||
In this chapter, I discuss how the exponential distribution arises from a basic assumption. | ||
|
||
## The core questions | ||
|
||
The core questions we are asking is: _What is the distribution of total amount within a given time interval?_ | ||
|
||
Using the scenario with per-event amount, we can break this down into: | ||
1. _What is the probability distribution of the waiting time to the next event at each moment in time?_ | ||
2. _What is the probability distribution of the amount of the next event?_ | ||
|
||
Any time series model is basically an attempt to answer this, | ||
each of them using a different set of assumptions, simplifications and approximations. | ||
|
||
For now, we will only focus on answering question number 1 as it seems more challenging in practice. | ||
In e-commerce the rate of orders may reasonably be expected to change much more than the amount requested in each single order. | ||
|
||
## Waiting times – the simplest model | ||
|
||
We will model time as continuum (i.e. timestamps are represented by real numbers) | ||
and we make two assumptions that could be considered rather strong in the time series context: | ||
1. The waiting times are [i.i.d](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables). | ||
In the time series case, this means the distribution parametrization is constant in time. | ||
|
||
This may seem overly naïve in many cases (e.g. because of popularity trends or seasonalities), | ||
but keep in mind that it will hold at least locally | ||
in most practical cases in the sense that while the distribution does in fact change in time, there will be some sort of continuity to it. | ||
2. The distribution is [memoryless](https://en.wikipedia.org/wiki/Memorylessness). | ||
|
||
This means, the time already spent waiting for an event does not affect how much longer we need to wait for the next one to occur. | ||
More formally, this is written as | ||
|
||
$$P(Y > t+s | Y > t)=P(Y > s)$$ | ||
|
||
for the random variable $Y$ describing the waiting time between events, where $t>0$ can be interpreted as time already waited and $s>0$ as time yet to wait. | ||
|
||
Let's consider when this is justifiable. A customer not caring about how long ago the previous customer purchased something seems like a reasonable assumption. On the other hand, if we sell a consumable product, and the same customer wants to buy it again soon but not immediately, the memoryless assumption may no longer be appropriate. | ||
|
||
|
||
It can be shown rigorously, that the continuous memomoryless distribution | ||
needs to have density | ||
|
||
$$p(t) \propto \exp(-\lambda t)$$ | ||
|
||
for some parameter $\lambda>0$. | ||
This is of course the [exponential distribution](https://en.wikipedia.org/wiki/Exponential_distribution). | ||
|
||
The idea of the proof is to rewrite (using conditional probability definition) the memorylessness assumption as | ||
|
||
$$P(Y > t+s) = P(Y > s) \cdot P(Y > t)$$ | ||
|
||
and notice that this describes a situation where the complementary cumulative distribution function maps addition to multiplication, | ||
which only an exponential function can do. | ||
|
||
## Learning | ||
Using both [maximum likelihood estimation](https://www.statlect.com/fundamentals-of-statistics/exponential-distribution-maximum-likelihood) | ||
and [method of moments](https://math.stackexchange.com/questions/2943668/method-of-moments-exponential-distribution), | ||
we can obtain the estimate for $\lambda$ to be the reciprocal of the sample (arithmetic) mean: | ||
|
||
$$\hat{\lambda}=\frac{1}{\bar{\mathbf{y}}}$$ | ||
|
||
for the waiting times $\mathbf{y}$. | ||
This aligns nicely the fact that $\mathrm{E}\[Y\] = \frac{1}{\lambda}$. | ||
|
||
|
||
## Assumptions remarks | ||
Note that only the memoryless assumption is needed to derive the expnential distribution. | ||
The i.i.d. assumption lives on a higher level, but I wanted to emphasise the fact that | ||
we will eventually be modelling time series and the i.i.d. assumption would imply the $\lambda$ parameter is constant in time. | ||
But of course, we need the i.i.d. to reasonably learn the parameter from data. | ||
|
||
If we consider the memorylessness assumption to be unrealistic, | ||
simple distributions often used to describe waiting times include [Weibull](https://en.wikipedia.org/wiki/Weibull_distribution) | ||
and [Gamma](https://en.wikipedia.org/wiki/Gamma_distribution) but in principle any distribution with $(0,\infty)$ support can be used when justified. | ||
These both generalize the exponential distribution in different directions. More on that in later chapters. | ||
|
||
However, the identical distribution part of the i.i.d. assumption seems way more problematic if we want to model anything non-trivial. | ||
We focus on this in the subsequent chapters. | ||
|
||
|
||
## STAN demo – fitting exponential distribution | ||
|
||
```stan | ||
data { | ||
int<lower=1> N; | ||
array[N] real<lower=0> y; // Waiting times | ||
} | ||
parameters { | ||
real<lower=0> lambda; | ||
} | ||
model { | ||
y ~ exponential(lambda); | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
--- | ||
title: 'Building a probabilistic time series model from scratch: Chapter 1 – Counting events' | ||
layout: post | ||
date: 2024-10-17 | ||
categories: [] | ||
tags: [] | ||
pin: true | ||
math: true | ||
--- | ||
|
||
In the previous chapter I claimed that at the lowest level of any time series | ||
probabilistic model are timestamps separated with randomly distributed waiting times. | ||
However, in practice the quantity we are actually more interested in e-commerce sales | ||
are counts of the events within a given interval. | ||
|
||
In this chapter we study the general case when the waiting time distribution does not need to satisfy the memoryless | ||
property and then show how Poisson and Gamma distribution arise assuming memorylessness of the basic events. | ||
|
||
## Basic formalization | ||
|
||
Let's describe what does this mean formally. | ||
We still keep the i.i.d. assumption for now, but consider a general | ||
probability distribution over $(0,\infty)$ with density $p(t)$ describing | ||
the probability of waiting time. The waiting times are a sequence of random variables $Y_1,Y_2,...$. | ||
Now, the time of the $k$-th event is: | ||
|
||
$$S_k = Y_1 + Y_2 + ... + Y_k$$ | ||
|
||
We are interested in the number of events $N(T)$ that occur in the interval $(0,T)$. That is: | ||
|
||
$$N(T) = \max \{ k | S_k < T \}$$ | ||
|
||
For further reference, this line of thinking is developed in the [renewal theory](https://en.wikipedia.org/wiki/Renewal_theory). | ||
|
||
## Probability of $k$ events | ||
|
||
Consider that the $k$-th even must occur before $T$ and $(k+1)$-the event must occur after $T$. | ||
We can write that down as an equation: | ||
|
||
$$P(N(T) = k) = P(S_k < T \le S_{k+1}) = P(S_k < T, T \le S_{k+1})$$ | ||
|
||
Using the independence assumption we can break this into: | ||
|
||
$$P(S_k < T, T \le S_{k+1}) = P(S_k < T) \cdot P(T \le S_{k+1})$$ | ||
|
||
The probability of the $k$-th event occuring before time $T$ is given by the cumulative distribution function (CDF) | ||
|
||
$$F_k(T) = P(S_k < T)$$ | ||
|
||
and the probability of the $(k+1)$-th event is given by the complementary CDF. | ||
Putting this together, we obtain | ||
|
||
$$P(N(T) = k) = F_k(T) \cdot (1 - F_{k+1}(T)).$$ | ||
|
||
## Cumulative density of $S_k$ | ||
We could naturally ask: what is a more explicit formula for $F_k(T)$? | ||
|
||
Obviously, for $F_1(T) = \int_0^T p(t) \mathrm{d}t$. Consider now that $S_2 = Y_1 + Y_2$ | ||
has an arbitrary realization where $S_2$ happens at time $t$, and $Y_1,Y_2$ happen at $y_1,y_2$ respectively. | ||
We have $y_2 = t - y_1$. | ||
Since the events $Y_1, Y_2$ are assumed i.i.d., the probability density will be: | ||
|
||
$$p_{S_2}(t) = p(y_1) \cdot p(t - y_1)$$ | ||
|
||
Summing over all possible times of $y_1$, we see that this is just convolution: | ||
|
||
$$p_{S_2}(t) = \int_{-\infty}^{\infty} p(y_1) \cdot p(t - y_1) \mathrm{d}y_1 = (p \ast p)(t)$$ | ||
|
||
This logic can be applied inductively to obtain that $p_{S_k}(t) = p^{\ast k}(t)$ and thus: | ||
|
||
$$F_k(T) = \int_0^T p_{S_k}(t) \mathrm{d}t = \int_0^T p^{\ast k}(t) \mathrm{d}t$$ | ||
|
||
## Special case for exponential distribution | ||
The above approach is rather general and can be in principle used for any i.i.d. waiting times. | ||
In the previous chapter, we have shown that the exponential distribution arises from | ||
a memorylessness assumption which may quite reasonable in practical modelling. | ||
Let's see what happens when we use this as the waiting time and plug it into the above results. | ||
Under this extra assumption, the renewal process is called (homogeneuous) [Poisson process](https://en.wikipedia.org/wiki/Poisson_point_process). | ||
|
||
As a reminder, the density of the exponential distribution has a single parameter $\lambda$ and is given by: | ||
|
||
$$p(t) = \lambda e^{-\lambda t}$$ for $t > 0$, otherwise $0$. | ||
|
||
Convolving this density with itself produces: | ||
|
||
$$p_{S_2}(t) = (p \ast p)(t) = \int_{0}^{t} \lambda e^{-\lambda y} \cdot \lambda e^{-\lambda (t-y)} \mathrm{d}y$$ | ||
|
||
The $y$s cancel out and we can simplify to | ||
|
||
$$p_{S_2}(t) = \lambda^2 e^{-\lambda t} \int_{0}^{t} \mathrm{d}y = \lambda^2 e^{-\lambda t} \cdot t$$ | ||
|
||
Again, this pattern can be inductively generalized to: | ||
|
||
$$p_{S_k}(t) \propto \lambda^k t^{k-1} e^{-\lambda t}$$ | ||
|
||
As a side note, we may observe an interesting fact. The above is actually saying: | ||
|
||
$$S_k \sim Gamma(k, \lambda)$$ | ||
|
||
From the formula for $p_{S_k}$ we can easily get: | ||
|
||
$$F_k(T) = \int_0^T p_{S_k}(t) \mathrm{d}t = T \cdot p_{S_k}(T)$$ | ||
|
||
$$F_k(T) \propto (\lambda T)^k e^{-\lambda T}$$ | ||
|
||
Which in this case is actually telling us that: | ||
|
||
$$N(T) \sim Poisson(\lambda T)$$ | ||
|
||
Et voilà! Going through this convoluted (pun intended) route, we eventually arrived | ||
at the good old familiar Poisson distribution probability mass function. | ||
|
||
### Takeaway | ||
We derived the Poisson distribution as the count of events of a random process from nothing but the following assumptions: | ||
* The events are i.i.d. | ||
* The waiting time between the events is a memoryless continuous random distribution. | ||
|
||
We can also see how the Poisson distribution is intimately related to the Gamma distribution: | ||
The Poisson gives counts within an interval and the Gamma gives the length of the interval to achieve a given count, | ||
both using a common rate parameter. | ||
This close relation is further emphasised in Bayesian statistics where the Gamma distribution can serve at the conjugate prior for the rate | ||
of the exponential, Poisson or another Gamma distribution. | ||
|
||
If the memoryles assumption doesn't seem realistic, it can be worked around using the general method at the cost of higher complexity. | ||
|
||
### Learning | ||
Using both maximum likelihood estimation | ||
and method of moments, | ||
we can obtain the estimate for $\lambda$ to be the sample (arithmetic) mean: | ||
|
||
$$\hat{\lambda}=\bar{\mathbf{n}}$$ | ||
|
||
for the counts $\mathbf{n}$. | ||
|
||
Compare this to the reciprocal value for the exponential distribution. | ||
The duality between the interpretation of the rate parameter for the Poisson and exponential | ||
distributions can be seen as analogical to the duality between frequency and period. |