-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_statistics.qmd
487 lines (357 loc) · 14.7 KB
/
02_statistics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
# Processing of hydrological dataset {#sec-processing}
#### The following concepts will be discussed: {.unnumbered}
::: callout-warning
## Concepts
- A hydrologic year,
- data screening and cleaning,
- summation and aggregation of data,
- descriptive statistics,
- visualization of statistical data.
:::
A dataset is a collection of measurements which is used to create or evaluate
assumptions about a population. Within this session, only **univariate** data are studied.
## Exploratory Data Analysis (EDA)
EDA is not a strictly formalized set of procedures, rather it is an general approach
on how to study, describe, extract, evaluate and report about the data.
Hence the workflow is specific to data in process.\
When studying data, we will look at certain statistical features, which represent esteemed characteristics of the whole data set.
```{r, collapse=TRUE}
set.seed(123) #<1>
x <- rnorm(n = 30, mean = 50, sd = 10) #<2>
x
```
1. Using the set seed will ensure that we will get the same generated numbers every time.
2. Some random-generated numbers from the Normal distribution.
::: callout-tip
## Exercise
1. Try to play with the function `rnorm()` by chaging the parameters. What is the
function of `set.seed()`?
:::
### Measures of location
The farthest one can reduce data, while still retaining any information at all is by a single value. They describe a central tendency of the data.
#### Mean
The **arithmetic mean** (or average) is simply the sum of observations devided by the number of observations.
$$
\bar{x} = \dfrac{1}{n}\sum\limits_{i=1}^{n} x_i
$$
#### Median
The **median** is another central tendency based on the *sorted* data set. It is the $50$th quantile in the data. A half of the values is smaller than median, and a half is larger.
```{r, collapse=TRUE}
median(x)
(sort(x)[length(x)/2] + sort(x)[length(x)/2 + 1])/2
```
#### Mode
For continuous variables in real numbers, computation of **mode** can be done either
by cutting the values into meaningful bins or using the kernel density estimate.
```{r}
cut(x, breaks = 10)
y <- cut(x, breaks = seq(from = -10, to = 100, by = 10))
table(y)
barplot(height = table(y))
```
::: callout-tip
## Exercise
1. a) Change the `cut()` function up so that the vector $x$ is cut into 20 bins.
b) Add a single number to vector x so that the sample mean rises to `r fitb(c("64.131", "64,131"))`
:::
#### $n^{\mathrm{th}}$ quantile
```{r}
quantile(x, probs = c(0.1, 0.9))
```
### Measures of spread
With **measures of variability** we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.
#### Variation range
The simples measure of spread is the **variation range**. It is computed as the difference of both end extremes of the data.
$$
R = \max{x} - \min{x}
$$
```{r}
max(x) - min(x)
```
#### Variance and sample variance
We calculate the **variance** as the average of the squares of the deviations of
individual character values from their arithmetic mean $\bar{x}$.
$$
\begin{array}{rl}
\sigma^2 =& \dfrac{1}{n}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2\\
s^2 =& \dfrac{1}{n-1}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2
\end{array}
$$
Usually we do not have the ability to work with the whole population data.
$s^2$, the **sample variance** formula needs to be used for and unbiased estimate.
#### Standard deviation
The **standard deviation** is defined as square root of variance.
```{r, collapse=TRUE}
sd(x)
var(x)^0.5
```
#### Coefficient of variation
The **coefficient of variation** is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.
$$
\nu = \frac{s}{\bar{x}}
$$
#### Interquartile range
The $\text{IQR}$ is the **upper quartile** ($75$th percentile) minus the **lower quartile** ($25$th percentile)
$$
\text{IQR} = Q_{\text{III}} - Q_{\text{I}}
$$
```{r}
IQR(x)
```
::: callout-tip
## Exercise
1. Using the knowledge from the @sec-introduction create own function to
calculate the sample variance according the formula.
:::
## Hydrological data
Datasets containing hydrological data are quite commonly, although not exclusively, in tabular (rectangular) shape.
Let's take a look at sample data from **MOPEX**. It is a fairly extensive curated dataset.
This dataset covered some 438 catchments in daily step with measured discharge. The dataset used
to be publicly available at https://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily,
the site is unavailable now. Several even more extensive datasets recently came out.
```{r, collapse=FALSE}
url <- "data/01138000.dly" #<1>
mpx_01138000 <- read.fwf(file = url, #<2>
widths = c(8, rep(10, times = 5)), #<2>
header = FALSE) #<2>
names(mpx_01138000) <- c("date", "prec", "pet", "r", "tmax", "tmin") #<3>
mpx_01138000$date <- gsub(x = mpx_01138000$date, #<4>
pattern = " ", #<4>
replacement = "0") #<4>
mpx_01138000$date <- as.Date(x = mpx_01138000$date, #<4>
format = "%Y%m%d") #<4>
mpx_01138000[which(mpx_01138000$r < 0), "r"] <- NA #<5>
rbind(head(mpx_01138000, n = 5), #<6>
tail(mpx_01138000, n = 5)) #<6>
```
1. This file is stored locally as part of this site. You have to change to your path.
2. Load the data in **fixed width format** using the `read.fwf()` function.
3. Provide variable names, since there aren't any.
4. Now there is a trouble with dates. If you evaluate sequentially, you could see that
the date format expects **"YYYY-mm-dd"** format, we have to fill the blank spots with
zeroes to comply.
5. After that, some variables contain non-physical values like `-9999`, these would make
troubles in statistical processing, we have to replace them with `NA` (not assigned).
6. Let's display the 5 first and 5 last rows of the result.
::: callout-tip
## Exercise
1. Data coming from field measurements are usually accompanied with *metadata*.
This could be information about when the data were collected, the site location,
precision, missing values etc. Sometimes the metadata are in a header of the file.
Download and process [this file](data/6242910_Q_Day.Cmd.txt) in a similar manner
using help of this `read.txt()` function.
:::
## Aggregation and summation
Let's add some features to the data. For the purpose of the analysis it would be
beneficial to add `year`, `month`, `quarters`, `decade` and `hyear` as **hydrological year**.
```{r}
mpx_01138000$year <- as.integer(format(mpx_01138000$date, "%Y")) #<1>
mpx_01138000$month <- as.integer(format(mpx_01138000$date, "%m")) #<1>
```
1. With the help of special characters we are extracting the years and months consecutively.
::: callout-tip
## Exercise
1. a) Add `quart` variable to `mpx_01138000` using `quarters()` function.
b) Provide the fraction of days with no precipitation: `r fitb(c("0.221277", "0,221277", "4526/20454"))`
c) How many freezing days (Tmin < 0) occured? `r fitb(10244)`
d) How many freezing days occured in late spring (May) and summer? `r fitb(308)`
e) Find 3 highest `r fitb("34.3056, 33.6333, 33.4889")` and 3 lowest temperatures `r fitb("-17.0000, -16.4889, -16.4611")` (delimit by a comma).
:::
```{r}
mpx_01138000$hyear <- ifelse(test = mpx_01138000$month > 10, #<1>
yes = as.integer(mpx_01138000$year + 1), #<1>
no = as.integer(mpx_01138000$year)) #<1>
```
1. The hydrological year in Czechia is defined from $1^{\mathrm{st}}$ November to $31^{\mathrm{st}}$ October next year.
These functions are based on **grouping**. In hydrology, the natural groups involve - *stations/basins*, *decades/years/months*, *groundwater regions*, etc.
### Box-plot
Come handy, when we want to visualize the important quantiles related to any categorical variable.
```{r, fig.align='center'}
station <- read.delim(file = "./data/6242910_Q_Day.Cmd.txt",
skip = 36,
header = TRUE,
sep = ";",
col.names = c("date", "time", "value"))
station$date <- as.Date(station$date,
format = "%Y-%m-%d")
station_agg <- aggregate(station$value ~ as.factor(data.table::month(station$date)),
FUN = "quantile",
probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
names(station_agg) <- c("Month", "Discharge")
par(font.main = 1,
adj = 0.1,
xaxs = "i",
yaxs = "i")
boxplot(data = station_agg,
Discharge ~ Month,
main = "Distribution of discharge in months",
frame = FALSE,
ylim = c(0,20),
xlab = "",
ylab = bquote(Discharge), font.main = 1)
```
### Q-Q plot
The so called **rankit graph** produces a quantile-quantile graph of the values
from selected data. This one can compare if the distribution of the data fit
with the assumed distribution, e.q. Normal. `qqline` then adds the theoretical line.
```{r, fig.align='center'}
par(font.main = 1,
adj = 0.1,
xaxs = "i",
yaxs = "i")
qqnorm(mpx_01138000$tmax,
pch = 21,
col = "black",
bg = "lightgray", cex = 0.5,
main = "")
qqline(mpx_01138000$tmax)
```
<!-- ```{r} -->
<!-- plot(quantile(rnorm(1000), probs = 1:100/100), -->
<!-- quantile(rnorm(1000), probs = 1:100/100), frame = FALSE, pch = 20) -->
<!-- abline(0, 1, col = "red") -->
<!-- ``` -->
<!-- ### P-P plot -->
<!-- ```{r} -->
<!-- n <- 100 -->
<!-- data <- rweibull(n, shape = 2, scale = 1) -->
<!-- # Estimated parameters for Pareto -->
<!-- shape_est <- 2 -->
<!-- scale_est <- 1 -->
<!-- # Calculate the theoretical quantiles for the Pareto distribution -->
<!-- theoretical_quantiles <- qweibull(p = seq(0, 1, 0.01), shape = shape_est, scale = scale_est) -->
<!-- # Create the Q-Q plot -->
<!-- qqplot(theoretical_quantiles, data, -->
<!-- xlab = "Theoretical Quantiles", -->
<!-- ylab = "Sample Quantiles", ylim = c(0,2.5), xlim = c(0, 2.5)) -->
<!-- abline(0, 1, col = "red") # Add a 45-degree reference line -->
<!-- ``` -->
### Empirical distribution function
Large portion of the data which is processed in Hydrology originates from time series
of streamflow or runoff. This enables us to construct empirical probability of exceeding
certain value in the data $P(X\geq x_k)$, simply using the well know definition
$$
P = \dfrac{m}{n}
$$
where $m$ is the number of reaching or exceeding of value $x_k$ and $n$
is the length of the series. This equation is valid strictly for $n \rightarrow \infty$.\
There are several empirical formulas in use to calculate the empirical exceedance
probability like the one from Čegodajev
$$
P = \dfrac{m - 0.3}{n + 0.4}
$$
In R we can utilize a highe order function called `ecdf()`
```{r, include=FALSE}
plot(cumsum(1/rank(mpx_01138000$r)/length(mpx_01138000$r)))
```
```{r}
ecdf_data <- ecdf(na.omit(mpx_01138000$r)) # <1>
ecdf_data(35) #<2>
```
1. # Create the empirical cumulative distribution function (ECDF) for the data.
2. # Calculate percent of data lower than 35.
<!-- ```{r, eval=FALSE} -->
<!-- library(tidyverse) -->
<!-- # ddd <- read_fwf("data/6242910_Q_Day.Cmd.txt", -->
<!-- # skip = 36, -->
<!-- # col_positions = fwf_widths(widths = c(8, 4, 4, 8, 2)), col_types = "cccdc") -->
<!-- ddd <- read_csv2("data/6242910_Q_Day.Cmd.txt", -->
<!-- skip = 36, -->
<!-- col_types = "ccc") -->
<!-- ddd |> -->
<!-- set_names(c("Date", "D", "Discharge")) |> -->
<!-- mutate(Date = as.Date(Date, format = "%Y-%m-%d"), -->
<!-- Discharge = as.numeric(Discharge)) |> -->
<!-- select(Date, Discharge) -> ddd -->
<!-- ddd |> -->
<!-- mutate(ri = rank(Discharge), -->
<!-- s = sd(Discharge), -->
<!-- q = pweibull(Discharge, shape = 2, scale = 1)) -->
<!-- pdfunc <- function(x, b = 0.44) { -->
<!-- n <- length(x) -->
<!-- m <- rank(x) -->
<!-- (m - b)/(n + 1 - 2*b) -->
<!-- } -->
<!-- plot(pdfunc(x <- ddd$Discharge), type = "l") -->
<!-- ddd |> -->
<!-- ggplot(aes(Date, Discharge)) + -->
<!-- geom_line() -->
<!-- ddd |> -->
<!-- ggplot(aes(Discharge, ri = rank(Discharge / length(Discharge)))) + -->
<!-- geom_line() -->
<!-- ddd |> -->
<!-- mutate(pi = rank(Discharge - 0.5)/length(Discharge)) -->
<!-- x <- c(1, 6 ,6, 1, 2, 0) -->
<!-- rank(x) -->
<!-- ``` -->
### Exceedance curve
Provides an information on how many times or for how long a streamflow value
was reached or exceeded within a certain period.
```{r, fig.align='center'}
r_sorted <- na.omit(sort(mpx_01138000$r, decreasing = TRUE))
# Empirical
n <- length(r_sorted)
exceedance_prob <- 1:n / n
rank <- seq(1:n)
return_period <- (n + 1) / rank
exceedance_prob <- 1 / return_period
plot(x = exceedance_prob,
y = r_sorted,
type = "l",
xlab = "Exceedance Probability",
ylab = "Runoff Value",
main = "Empirical Exceedance Curve")
```
<!-- ### Return period -->
<!-- We calculate the return period as inverse to the exceedance probability -->
<!-- ```{r, fig.align='center'} -->
<!-- # return_period <- 1/exceedance_prob -->
<!-- plot(x = return_period, -->
<!-- y = r_sorted, -->
<!-- type = "l", -->
<!-- xlab = "Return Period", -->
<!-- ylab = "Runoff Value", -->
<!-- main = "Empirical Return Period") -->
<!-- ``` -->
<!-- ### Frequency distribution curve -->
<!-- ## Correlation and autocorrelation -->
<!-- $$ -->
<!-- \rho_{X, Y} = \dfrac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} -->
<!-- $$ $$ -->
<!-- \rho_{X,X} -->
<!-- $$ -->
## Hydroclimatic indices
Originated from the combination of aggregation and summation methods and the measures
of location and dispersion.
### Runoff coefficient
The concept of runoff coefficient comes from the presumption, that over long-time period
$$
\varphi [-] = \dfrac{R}{P}
$$
where $R$ represents runoff and $P$ precipitation in long term, typically 30 year
so called **Climatic period**, which is a decadal 30 year reference moving window;
the latest being 1991--2020.
```{r}
R <- mean(aggregate(r ~ year, FUN = mean, data = mpx_01138000)[, "r"])
P <- mean(aggregate(prec ~ year, FUN = mean, data = mpx_01138000)[, "prec"])
R/P
```
::: callout-tip
## Exercise
1. Calculate the runoff coefficient $\varphi$ for the two latest decades separately.
:::
### Flow duration curve
This is an essential curve in energy as well as river ecology sector.
```{r, fig.align='center'}
M <- c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)
m_day <- function(streamflow) {
quantile(na.omit(streamflow), 1 - M/365.25)
}
plot(M, m_day(mpx_01138000$r),
type = "b",
pch = 21,
bg = "darkgray",
xlab = "",
ylab = "Discharge",
main = "M-day discharge")
```