Skip to content

Commit

Permalink
Adding documentation for potential normalization issue (#1937)
Browse files Browse the repository at this point in the history
* documents consideration of KDE for estimating CDF of data

* replaces a mistakenly deleted character

* whitespace

Co-authored-by: Jacob Bryan <[email protected]>
  • Loading branch information
j-bryan and Jacob Bryan authored Aug 9, 2022
1 parent 7e525d4 commit 5652ca0
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 3 deletions.
11 changes: 11 additions & 0 deletions doc/theory_manual/raven_theory_manual.bib
Original file line number Diff line number Diff line change
Expand Up @@ -828,3 +828,14 @@ @article{Bailey2018
Volume = {Project Series 02},
Year = {2018}
}

@article{morales_methodology_2010,
title={A methodology to generate statistically dependent wind speed scenarios},
author={Morales, Juan M and Minguez, Roberto and Conejo, Antonio J},
journal={Applied Energy},
volume={87},
number={3},
pages={843--855},
year={2010},
publisher={Elsevier}
}
36 changes: 33 additions & 3 deletions doc/theory_manual/statisticalAnalysis.tex
Original file line number Diff line number Diff line change
Expand Up @@ -298,9 +298,39 @@ \subsection{Normalized Sensitivity Matrix}
\item $\mathbb{E}(\boldsymbol{X})$ is the expected value of the output space
\end{itemize}

\subsection{Gaussianizing Noise of an Arbitrary Distribution}
Many statistical models require noise in a signal to be Gaussian.
However, this is difficult to guarantee in real-world applications.
In RAVEN, the ARMA models in both the SupervisedLearning and TSA modules perform the distribution transformation from Morales et al. (2010) to Gaussianize noise of an arbitrary, unknown distribution \cite{morales_methodology_2010}.

This transformation composes the cumulative distribution function (CDF) of the data and the inverse of the CDF of the unit normal distribution and is defined by
\begin{equation} \label{eq:gaussianizing_transform}
y_t = \Phi^{-1}\left[F_X\left(x_t\right)\right],
\end{equation}
where $\Phi$ is the CDF of the unit normal distribution, $F_X$ is the CDF of the input process $X_t$, $x_t$ comprise the realized random process, and the values $y_t$ are the Gaussianized realization.
This transformation operates by determining the quantile of $x_t$ in the original distribution via $F_X$, then determining the value in the normal distribution with the same quantile with the inverse of CDF $\Phi$.
New realizations in the Gaussian process, $\tilde{y}_t$, can be converted to their equivalent values in the original distribution by applying the inverse transformation
\begin{equation} \label{eq:gauss_inverse_transform}
\tilde{x}_t = F_X^{-1}\left[\Phi\left(\tilde{y}_t\right)\right].
\end{equation}

This transformation assumes that $F_X$ is known, but this is very often not the case with real-world data.
Therefore, we are left to approximate $F_X$ before applying Eq.~\ref{eq:gaussianizing_transform}.
In RAVEN, this is done using the empirical CDF (ECDF), which attributes a point probability of $1 / N$ to each of $N$ data, resulting in a step-wise approximation of the true CDF.
As a result of this method, ECDF $\hat{F}_X(x) = 0$ for $x < \min(x_t)$ and $\hat{F}_X(x) = 1$ for $x > \max(x_t)$.
This is problematic because $\Phi$ only approaches 0 and 1 asymptotically, so Eq.~\ref{eq:gaussianizing_transform} is not well-defined for values $x \leq \min(x_t)$ and $x \geq \max(x_t)$.
This issue is handled by adjusting the extreme values of $\hat{F}_X$ by some $\varepsilon$, such that
\begin{align*}
\hat{F}_X(\min(x_t)) &= \varepsilon, \\
\hat{F}_X(\max(x_t)) &= 1 - \varepsilon,
\end{align*}
where $\varepsilon$ is small.
However, the exact choice of $\varepsilon$ is rather arbitrary and can drastically change the quantile of the minimum and maximum values of $x_t$ due to the asymptotic behavior of $\Phi^{-1}$ approaching $0^+$ and $1^-$.
For example, taking $\varepsilon$ to be machine epsilon for floating values in numpy ($\varepsilon \approx 2.22 \times 10^{-16}$), the extreme values of $y_t$ lie over 8 standard deviations from the mean, which is extremely unlikely to be the case in actuality.




Kernel density estimation (KDE) has been considered as an alternative to the ECDF approximation method because KDE will preserve the asymptotic behavior of the CDF.
While KDE does provide superior estimation of the probability of the extreme values of $x_t$, the other values of the CDFs are approximately equal, providing little practical benefit despite the drastically increased computational expense of KDE.
It appears that poorly estimated extrema could result in effective upper and lower bounds on the sampled data due to the extreme unlikelihood of sampling values greater than 6-8 standard deviations from the mean.
However, this is not an issue in practice because the unit normal distribution is not being sampled directly.
Instead, an ARMA model is being fit to $y_t$, and the noise term of the ARMA model is not restricted to having a variance of 1.
As a result, realizations of the model can produce arbitrarily large values and thus can yield values similar to the original $x_t$ when Eq.~\ref{eq:gauss_inverse_transform} is applied, despite the poor approximation of the extreme values of $F_X$ by $\hat{F}_X$.
3 changes: 3 additions & 0 deletions ravenframework/SupervisedLearning/ARMA.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,6 +617,9 @@ def __trainLocal__(self, featureVals, targetVals):
# Transform data to obatain normal distrbuted series. See
# J.M.Morales, R.Minguez, A.J.Conejo "A methodology to generate statistically dependent wind speed scenarios,"
# Applied Energy, 87(2010) 843-855
#
# Kernel density estimation has also been tried for estimating the CDF of the data but with little practical
# benefit over using the empirical CDF. See RAVEN Theory Manual for more discussion.
for t,target in enumerate(self.target):
# if target correlated with the zero-filter target, truncate the training material now?
timeSeriesData = targetVals[:,t]
Expand Down

0 comments on commit 5652ca0

Please sign in to comment.