Adding documentation for potential normalization issue (#1937)

* documents consideration of KDE for estimating CDF of data * replaces a mistakenly deleted character * whitespace Co-authored-by: Jacob Bryan <[email protected]>
idaholab · Aug 9, 2022 · 5652ca0 · 5652ca0
1 parent 7e525d4
commit 5652ca0
Show file tree

Hide file tree

Showing 3 changed files with 47 additions and 3 deletions.
diff --git a/doc/theory_manual/raven_theory_manual.bib b/doc/theory_manual/raven_theory_manual.bib
@@ -828,3 +828,14 @@ @article{Bailey2018
     Volume = {Project Series 02},
     Year = {2018}
 }
+
+@article{morales_methodology_2010,
+	title={A methodology to generate statistically dependent wind speed scenarios},
+	author={Morales, Juan M and Minguez, Roberto and Conejo, Antonio J},
+	journal={Applied Energy},
+	volume={87},
+	number={3},
+	pages={843--855},
+	year={2010},
+	publisher={Elsevier}
+}
diff --git a/doc/theory_manual/statisticalAnalysis.tex b/doc/theory_manual/statisticalAnalysis.tex
@@ -298,9 +298,39 @@ \subsection{Normalized Sensitivity Matrix}
   \item $\mathbb{E}(\boldsymbol{X})$ is the expected value of the output space
 \end{itemize}
 
+\subsection{Gaussianizing Noise of an Arbitrary Distribution}
+Many statistical models require noise in a signal to be Gaussian.
+However, this is difficult to guarantee in real-world applications.
+In RAVEN, the ARMA models in both the SupervisedLearning and TSA modules perform the distribution transformation from Morales et al. (2010) to Gaussianize noise of an arbitrary, unknown distribution \cite{morales_methodology_2010}.
 
+This transformation composes the cumulative distribution function (CDF) of the data and the inverse of the CDF of the unit normal distribution and is defined by
+\begin{equation} \label{eq:gaussianizing_transform}
+	y_t = \Phi^{-1}\left[F_X\left(x_t\right)\right],
+\end{equation}
+where $\Phi$ is the CDF of the unit normal distribution, $F_X$ is the CDF of the input process $X_t$, $x_t$ comprise the realized random process, and the values $y_t$ are the Gaussianized realization.
+This transformation operates by determining the quantile of $x_t$ in the original distribution via $F_X$, then determining the value in the normal distribution with the same quantile with the inverse of CDF $\Phi$.
+New realizations in the Gaussian process, $\tilde{y}_t$, can be converted to their equivalent values in the original distribution by applying the inverse transformation
+\begin{equation} \label{eq:gauss_inverse_transform}
+	\tilde{x}_t = F_X^{-1}\left[\Phi\left(\tilde{y}_t\right)\right].
+\end{equation}
 
+This transformation assumes that $F_X$ is known, but this is very often not the case with real-world data.
+Therefore, we are left to approximate $F_X$ before applying Eq.~\ref{eq:gaussianizing_transform}.
+In RAVEN, this is done using the empirical CDF (ECDF), which attributes a point probability of $1 / N$ to each of $N$ data, resulting in a step-wise approximation of the true CDF.
+As a result of this method, ECDF $\hat{F}_X(x) = 0$ for $x < \min(x_t)$ and $\hat{F}_X(x) = 1$ for $x > \max(x_t)$.
+This is problematic because $\Phi$ only approaches 0 and 1 asymptotically, so Eq.~\ref{eq:gaussianizing_transform} is not well-defined for values $x \leq \min(x_t)$ and $x \geq \max(x_t)$.
+This issue is handled by adjusting the extreme values of $\hat{F}_X$ by some $\varepsilon$, such that
+\begin{align*}
+	\hat{F}_X(\min(x_t)) &= \varepsilon, \\
+	\hat{F}_X(\max(x_t)) &= 1 - \varepsilon,
+\end{align*}
+where $\varepsilon$ is small.
+However, the exact choice of $\varepsilon$ is rather arbitrary and can drastically change the quantile of the minimum and maximum values of $x_t$ due to the asymptotic behavior of $\Phi^{-1}$ approaching $0^+$ and $1^-$.
+For example, taking $\varepsilon$ to be machine epsilon for floating values in numpy ($\varepsilon \approx 2.22 \times 10^{-16}$), the extreme values of $y_t$ lie over 8 standard deviations from the mean, which is extremely unlikely to be the case in actuality.
 
-
-
-
+Kernel density estimation (KDE) has been considered as an alternative to the ECDF approximation method because KDE will preserve the asymptotic behavior of the CDF.
+While KDE does provide superior estimation of the probability of the extreme values of $x_t$, the other values of the CDFs are approximately equal, providing little practical benefit despite the drastically increased computational expense of KDE.
+It appears that poorly estimated extrema could result in effective upper and lower bounds on the sampled data due to the extreme unlikelihood of sampling values greater than 6-8 standard deviations from the mean.
+However, this is not an issue in practice because the unit normal distribution is not being sampled directly.
+Instead, an ARMA model is being fit to $y_t$, and the noise term of the ARMA model is not restricted to having a variance of 1.
+As a result, realizations of the model can produce arbitrarily large values and thus can yield values similar to the original $x_t$ when Eq.~\ref{eq:gauss_inverse_transform} is applied, despite the poor approximation of the extreme values of $F_X$ by $\hat{F}_X$.
diff --git a/ravenframework/SupervisedLearning/ARMA.py b/ravenframework/SupervisedLearning/ARMA.py
@@ -617,6 +617,9 @@ def __trainLocal__(self, featureVals, targetVals):
     # Transform data to obatain normal distrbuted series. See
     # J.M.Morales, R.Minguez, A.J.Conejo "A methodology to generate statistically dependent wind speed scenarios,"
     # Applied Energy, 87(2010) 843-855
+    #
+    # Kernel density estimation has also been tried for estimating the CDF of the data but with little practical
+    # benefit over using the empirical CDF. See RAVEN Theory Manual for more discussion.
     for t,target in enumerate(self.target):
       # if target correlated with the zero-filter target, truncate the training material now?
       timeSeriesData = targetVals[:,t]