-
Notifications
You must be signed in to change notification settings - Fork 14
/
07-Twostage.Rmd
492 lines (386 loc) · 29.3 KB
/
07-Twostage.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# Two-stage cluster random sampling {#Twostage}
As opposed to cluster random sampling in which all population units of a cluster are observed (Chapter \@ref(Cl)), in two-stage cluster random sampling\index{Two-stage cluster random sampling} not all units of the selected clusters are observed, but only some. In two-stage cluster random sampling the clusters will generally be contiguous groups of units, for instance all points in a map polygon (the polygons on the map are the clusters), whereas in one-stage cluster random sampling the clusters generally are non-contiguous. The units to be observed are selected by random subsampling of the randomly selected clusters. In two-stage cluster sampling, the clusters are commonly referred to as primary sampling units\index{Primary sampling unit} (PSUs) and the units selected in the second stage as the secondary sampling units\index{Secondary sampling unit} (SSUs).
As with cluster random sampling, two-stage cluster random sampling may lead to a strong spatial clustering of the selected population units in the study area. This may save considerable time for fieldwork, and more population units can be observed for the same budget. However, due to the spatial clustering the estimates will generally be less precise compared to samples of the same size selected by a design that leads to a much better spreading of the sampling units throughout the study area, such as systematic random sampling.
In two-stage cluster random sampling, in principle any type of sampling design can be used at the two stages, leading to numerous combinations. An example is (SI,SI), in which both PSUs and SSUs are selected by simple random sampling.
Commonly, the PSUs have unequal size, i.e., the number of SSUs (finite population) or the area (infinite population) are not equal for all PSUs. Think for instance of the agricultural fields, forest stands, lakes, river sections, etc. in an area. If the PSUs are of unequal size, then PSUs can best be selected with probabilities proportional to their size (pps). Recall that in (one-stage) cluster random sampling, I also recommended to select the clusters with probabilities proportional to their size, see Chapter \@ref(Cl). If the total of the study variable of a PSU is proportional to its size, then pps sampling leads to more precise estimates compared to simple random sampling of PSUs. Also, with pps sampling of PSUs, the estimation of means or totals and of their sampling variances is much simpler compared to selection with equal probabilities. Implementation of selection with probabilities proportional to size is easiest when units are replaced (pps with replacement, ppswr)\index{pps sampling!with replacement (ppswr)}. This implies that a PSU might be selected more than once, especially if the total number of PSUs in the population is small compared to the number of PSU draws (large sampling fraction in first stage).
Using a list as a sampling frame, the following algorithm can be used to select $n$ times a PSU by ppswr from a total of $N$ PSUs in the population:
1. Select randomly one SSU from the list with $M=\sum_{j=1}^N M_j$ SSUs ($M_j$ is the number of SSUs of PSU $j$), and determine the PSU of the selected SSU.
2. Repeat step 1 until $n$ selections have been made.
In the first stage, an SSU is selected in order to select a PSU. This may seem unnecessarily complicated. The reason for this is that this procedure automatically adjusts for the size of the PSUs (number of SSUs within a PSU), i.e., a PSU is selected with probability proportional to its size. In the second stage, a pre-determined number of SSUs, $m_{j}$, is selected every time PSU $j$ is selected.
Note that the SSU selected in the first step of the two algorithms primarily serves to identify the PSU, but these SSUs can also be used as selected SSUs.
The selection of a two-stage cluster random sample is illustrated again with Voorst. Twenty-four 0.5 km squares are constructed that serve as PSUs.
```{block2, type='rmdnote'}
Due to built-up areas, roads, etc., the PSUs in Voorst have unequal size, i.e., the number of SSUs (points, in our case) within the PSUs varies among the PSUs.
```
```{r}
cell_size <- 25
w <- 500 #width of zones
grdVoorst <- grdVoorst %>%
mutate(zone_s1 = s1 %>% findInterval(min(s1) + 1:11 * w + 0.5 * cell_size),
zone_s2 = s2 %>% findInterval(min(s2) + w + 0.5 * cell_size),
psu = str_c(zone_s1, zone_s2, sep = "_"))
```
In the next code chunk, a function is defined to select a two-stage cluster random sample from an infinite population, discretised by a finite number of points, being the centres of grid cells.
```{r}
twostage <- function(sframe, psu, n, m) {
units <- sample(nrow(sframe), size = n, replace = TRUE)
mypsusample <- sframe$psu[units]
ssunits <- NULL
for (i in seq_len(length(mypsusample))) {
ssunit <- sample(
x = which(sframe$psu == mypsusample[i]), size = m, replace = TRUE)
ssunits <- c(ssunits, ssunit)
}
psudraw <- rep(c(1:n), each = m)
mysample <- data.frame(ssunits, sframe[ssunits, ], psudraw)
mysample
}
```
Note that both the PSUs and the SSUs are selected with replacement. If a grid cell centre is selected, one point is selected fully randomly from that grid cell. This is done by shifting the centre of the grid cell to a random point within the selected grid cell with function `jitter`, see code chunk hereafter. In every grid cell, there is an infinite number of points, so we must select the grid cell centres with replacement. If a grid cell is selected more than once, more than one point is selected from the associated grid cell. Column `psudraw` in the output data frame of function `twostage` is needed in estimation because PSUs are selected with replacement. In case a PSU is selected more than once, multiple estimates of the mean of that PSU are used in estimation, see next section.
In the next code chunk, function `twostage` is used to select four times a PSU ($n=4$), with probabilities proportional to size and with replacement (ppswr). The second stage sample size equals 10 for all PSUs ($m_j=10,\; j = 1, \dots, N$). These SSUs are selected by simple random sampling.
```{r, eval = FALSE, echo = FALSE}
n <- 4
m <- 10
set.seed(314)
mysample <- twostage(sframe = grdVoorst, psu = "psu", n = n, m = m)
cell_size <- 25
mysample$s1 <- jitter(mysample$s1, amount = cell_size / 2)
mysample$s2 <- jitter(mysample$s2, amount = cell_size / 2)
```
```{r}
n <- 4
m <- 10
cell_size <- 25
set.seed(314)
mysample <- grdVoorst %>%
twostage(psu = "psu", n = n, m = m) %>%
mutate(s1 = s1 %>% jitter(amount = cell_size / 2),
s2 = s2 %>% jitter(amount = cell_size / 2))
```
Figure \@ref(fig:TwostageVoorst) shows the selected sample.
```{r TwostageVoorst, echo = FALSE, out.width = '100%', fig.cap = "Two-stage cluster random sample from Voorst. PSUs are 0.5 km squares, built-up areas, roads, etc. excluded. Four times a PSU is selected by ppswr. Each time a PSU is selected, ten SSUs (points) are selected from that PSU by simple random sampling."}
ggplot(data = grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = as.factor(psu))) +
geom_raster() +
scale_fill_viridis_d(alpha = 0.5) +
geom_point(data = mysample, size = 1.5) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(legend.position = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
## Estimation of population parameters {#twostagesamplingestimators}
The population total can be estimated by substituting the estimated cluster (PSU) totals in Equation \@ref(eq:EstTotalCl). This yields the following estimator for the population total:
\begin{equation}
\hat{t}(z) = \frac{M}{n} \sum_{j \in \mathcal{S}} \frac{\hat{t}_{j}(z)}{M_{j}} = \frac{M}{n} \sum_{j \in \mathcal{S}} \hat{\bar{z}}_{j} \;,
(\#eq:EstTotalTwostage)
\end{equation}
where $n$ is the number of PSU selections and $M_{j}$ is the total number of SSUs in PSU $j$. This shows that the mean of cluster $j$, $\bar{z}_j$, is replaced by the estimated mean of PSU $j$, $\hat{\bar{z}}_j$. Dividing this estimator by the total number of population units $M$ gives the pwr estimator of the population mean:
\begin{equation}
\hat{\bar{\bar{z}}}=
\frac{1}{n}\sum\limits_{j \in \mathcal{S}}\hat{\bar{z}}_{j} \;,
(\#eq:EstMeanTwostage)
\end{equation}
with $\hat{\bar{z}}_{j}$ the estimated mean of the PSU $j$. With simple random sampling of SSUs, this mean can be estimated by the sample mean of this PSU. Note the two bars in $\hat{\bar{\bar{z}}}$, indicating that the population mean is estimated as the mean of estimated PSU means. When $m_j$ is equal for all PSUs, the sampling design is self-weighting\index{Self-weighting sampling design}, i.e., the average of $z$ over all selected SSUs is an unbiased estimator of the population mean.
For an infinite population of points, the population total is estimated by multiplying the estimated population mean (Equation \@ref(eq:EstMeanTwostage)) by the area of the study area.
The sampling variance of the estimator of the mean with two-stage cluster random sampling, PSUs selected with probabilities proportional to size with replacement, SSUs selected by simple random sampling, with replacement in case of finite populations, and $m_j = m, \; j = 1, \dots, N$, is equal to (@coc77, equation (11.33)^[Equation (11.33) in @coc77 is the variance estimator for the estimator of the population total. In Exercise 5 you are asked to derive the variance estimator for the estimator of the population mean from this variance estimator.])
\begin{equation}
V(\hat{\bar{\bar{z}}}) = \frac{S^2_{\mathrm{b}}}{n} + \frac{S^2_{\mathrm{w}}}{n\;m} \;,
(\#eq:TrueVarEstMeanTwostage)
\end{equation}
with
\begin{equation}
S^2_{\mathrm{b}}=\sum_{j=1}^N p_j\left(\bar{z}_j-\bar{z}\right)^2
(\#eq:PooledBetweenClusterVariance)
\end{equation}
and
\begin{equation}
S^2_{\mathrm{w}}=\sum_{j=1}^N p_j S^2_j \;,
(\#eq:PooledWithinClusterVariance)
\end{equation}
with $N$ the total number of PSUs in the population, $p_j=M_j/M$ the draw-by-draw selection probability of PSU $j$\index{Draw-by-draw selection probability!of a primary sampling unit}, $\bar{z}_j$ the mean of PSU $j$, $\bar{z}$ the population mean of $z$, and $S^2_j$ the variance of $z$ within PSU $j$:
\begin{equation}
S^2_j = \frac{1}{M_j} \sum_{k=1}^{M_j} (z_{kj}-\bar{z}_j)^2 \;.
(\#eq:WithinClusterVariance)
\end{equation}
```{block2, type='rmdnote'}
The first term of Equation \@ref(eq:TrueVarEstMeanTwostage) is equal to the variance of Equation \@ref(eq:TrueVarEstMeanCl). This variance component accounts for the variance of the true PSU means within the population. The second variance component quantifies our additional uncertainty about the population mean, as we do not observe all SSUs of the selected PSUs, but only a subset (sample) of these units.
```
The sampling variance of the estimator of the population mean can simply be estimated by
\begin{equation}
\widehat{V}\!\left(\hat{\bar{\bar{z}}}\right)=\frac{\widehat{S^2}(\hat{\bar{z}})}{n} \;,
(\#eq:VarEstMeanTwostage)
\end{equation}
with $\widehat{S^2}(\hat{\bar{z}})$ the estimated variance of the *estimated* PSU means:
\begin{equation}
\widehat{S^2}(\hat{\bar{z}}) = \frac{1}{n-1}\sum_{j \in \mathcal{S}}(\hat{\bar{z}}_{j}-\hat{\bar{\bar{z}}})^2 \;,
(\#eq:S2psuMeans)
\end{equation}
with $\hat{\bar{z}}_{j}$ the estimated mean of PSU $j$ and $\hat{\bar{\bar{z}}}$ the estimated population mean (Equation \@ref(eq:EstMeanTwostage)).
```{block2, type='rmdnote'}
Neither the sizes of the PSUs, $M_j$, nor the secondary sample sizes $m_{j}$ occur in Equations \@ref(eq:VarEstMeanTwostage) and \@ref(eq:S2psuMeans). This simplicity is due to the fact that the PSUs are selected with replacement and with probabilities proportional to their size. The effect of the secondary sample sizes on the variance is implicitly accounted for. To understand this, note that the larger $m_{j}$, the less variable $\hat{\bar{z}}_{j}$, and the smaller its contribution to the variance.
```
Let us assume a linear model for the total costs: $C = c_0 + c_1n + c_2nm$, with $c_0$ the fixed costs, $c_1$ the costs per PSU, and $c_2$ the costs per SSU. We want to minimise the total costs, under the constraint that the variance of the estimator of the population mean may not exceed $V_{\mathrm{max}}$. The total costs can then be minimised by selecting [@gru06]
\begin{equation}
n=\frac{1}{V_{\mathrm{max}}}\left(S_{\mathrm{w}}S_{\mathrm{b}}\sqrt{\frac{c_2}{c_1}}+S^2_{\mathrm{b}}\right)
(\#eq:nopt)
\end{equation}
PSUs and
\begin{equation}
m=\frac{S_{\mathrm{w}}}{S_{\mathrm{b}}}\sqrt{\frac{c_1}{c_2}}
(\#eq:mopt)
\end{equation}
SSUs per PSU.
Conversely, given a budget $C_{\mathrm{max}}$, the optimal number of PSU selections\index{Optimal sample size in two-stage cluster random sampling} can be computed with [@gru06]
\begin{equation}
n=\frac{C_{\mathrm{max}}S_{\mathrm{b}}}{S_{\mathrm{w}}\sqrt{c_1c_2}+S_{\mathrm{b}}c_1}\;,
(\#eq:nopt2)
\end{equation}
and $m$ as above.
In **R** the population mean and the sampling variance of the estimator of the mean can be estimated as follows.
```{r, echo = FALSE, eval = FALSE}
mz_psu <- tapply(mysample$z, INDEX = mysample$psudraw, FUN = mean)
mz <- mean(mz_psu)
se_mz <- sqrt(var(mz_psu) / n)
```
```{r}
est <- mysample %>%
group_by(psudraw) %>%
summarise(mz_psu = mean(z)) %>%
summarise(mz = mean(mz_psu),
se_mz = sqrt(var(mz_psu) / n()))
```
The estimated mean equals `r formatC(est$mz, 1, format = "f")` g kg^-1^, and the estimated standard error equals `r formatC(est$se_mz, 1, format = "f")` g kg^-1^. The sampling design is self-weighting, and so the estimated mean is equal to the sample mean.
```{r}
print(mean(mysample$z))
```
The same estimate is obtained with functions `svydesign` and `svymean` of package **survey** [@Lumley2020]. The estimator of the population total can be written as a weighted sum of the observations with all weights equal to $M/(n\;m)$. These weights are passed to function `svydesign` with argument `weight`.
```{r}
library(survey)
M <- nrow(grdVoorst)
mysample$weights <- M / (n * m)
design_2stage <- svydesign(
id = ~ psudraw, weight = ~ weights, data = mysample)
svymean(~ z, design_2stage, deff = "replace")
```
Similar to (one-stage) cluster random sampling, the estimated design effect\index{Design effect} is much larger than 1.
A confidence interval estimate of the population mean can be computed with method `confint`. The number of degrees of freedom equals the number of PSU draws minus 1.
```{r}
confint(svymean(~ z, design_2stage, df = degf(design_2stage), level = 0.95))
```
Figure \@ref(fig:SamplingDistributionTwostage) shows the approximated sampling distribution of the pwr estimator of the mean soil organic matter (SOM) concentration with two-stage cluster random sampling and of the $\pi$ estimator with simple random sampling from Voorst, obtained by repeating the random sampling with each design and estimation 10,000 times. For simple random sampling the sample size is equal to $n \times m$.
```{r, echo = FALSE, eval = FALSE}
number_of_samples <- 10000
mz <- v_mz <- mz_SI <- numeric(length = number_of_samples)
set.seed(31415)
SI <- function(sframe, n) {
units <- sample(nrow(sframe), size = n, replace = FALSE)
mysample <- sframe[units, ]
mysample
}
for (i in 1:number_of_samples) {
mysample <- twostage(grdVoorst, "psu", n, m)
psuMeans <- tapply(mysample$z, INDEX = mysample$psudraw, FUN = mean)
mz[i] <- mean(psuMeans, na.rm = TRUE)
v_mz[i] <- var(psuMeans, na.rm = TRUE) / n
mySIsample <- SI(grdVoorst, n = n * m)
mz_SI[i] <- mean(mySIsample$z)
}
save(mz, mz_SI, v_mz, file = "results/Twostage_Voorst.rda")
```
(ref:SamplingDistributionTwostagelabel) Approximated sampling distribution of the pwr estimator of the mean SOM concentration (g kg^-1^) in Voorst with two-stage cluster random sampling (Twostage) and of the $\pi$ estimator with simple random sampling (SI). The sample size with both sampling designs is 40. In two-stage sampling, four times a PSU is selected by ppswr and ten SSUs (points) are selected per PSU draw by simple random sampling.
```{r SamplingDistributionTwostage, echo = FALSE, fig.width=5, fig.asp = .8, fig.cap = "(ref:SamplingDistributionTwostagelabel)"}
load(file = "results/Twostage_Voorst.rda")
estimates <- data.frame(mz, mz_SI)
names(estimates) <- c("Twostage", "SI")
df <- estimates %>% pivot_longer(cols = c("Twostage", "SI"))
df$name <- factor(df$name, levels = c("Twostage", "SI"), ordered = TRUE)
ggplot(data = df) +
geom_boxplot(aes(y = value, x = name)) +
geom_hline(yintercept = mean(grdVoorst$z), colour = "red") +
scale_x_discrete(name = "Sampling design") +
scale_y_continuous(name = "Estimated mean SOM")
v_mz_sim <- var(mz)
m_vmz_sim <- mean(v_mz)
v_mz_SI_sim <- var(mz_SI)
```
The variance of the 10,000 means with two-stage cluster random sampling equals `r formatC(var(mz), 1, format = "f")` (g kg^-1^)^2^. This is considerably larger than with simple random sampling: `r formatC(var(mz_SI), 1, format = "f")` (g kg^-1^)^2^. The average of the estimated variances with two-stage cluster random sampling equals `r formatC(mean(v_mz), 1, format = "f")` (g kg^-1^)^2^.
Optimal sample sizes for two-stage cluster random sampling (ppswr in first stage, simple random sampling without replacement in second stage) can be computed with function `clusOpt2` of **R** package **PracTools** (@PracTools, @Vaillant2018). This function requires as input various variance measures, which can be computed with function `BW2stagePPS`, in case the study variable is known for the whole population or estimated from a sample with function `BW2stagePPSe`. This is left as an exercise (Exercise 5).
#### Exercises {-}
1. Write an **R** script to compute for Voorst the true sampling variance of the estimator of the mean SOM concentration for two-stage cluster random sampling, PSUs selected by ppswr, $n=4$, and $m=10$, see Equation \@ref(eq:TrueVarEstMeanTwostage).
2. Do you expect that the standard error of the estimator of the population mean with ten PSU draws ($n=10$) and four SSUs per PSU draw ($m=4$) is larger or smaller than with four PSU draws ($n=4$) and ten SSUs per PSU draw ($m=10$)?
3. Compute the optimal sample sizes $n$ and $m$ for a maximum variance of the estimator of the mean SOM concentration of 1, $c_1=2$ monetary units, and $c_2=1$ monetary unit, see Equations \@ref(eq:nopt) and \@ref(eq:mopt).
4. Compute the optimal sample sizes $n$ and $m$ for a budget of 100 monetary units, $c_1=2$ monetary units, and $c_2=1$ monetary unit, see Equations \@ref(eq:nopt2) and \@ref(eq:mopt).
5. Use function `clusOpt2` of **R** package **PracTools** to compute optimal sample sizes given the precision requirement for the estimated population mean of Exercise 3 and given the budget of Exercise 4. Use function `BW2stagePPS` to compute the variance measures needed as input for function `clusOpt2`. Note that the precision requirement of function `clusOpt2` is the coefficient of variation of the estimated population total, i.e., the standard deviation of the estimated population total divided by the population total. Compute this coefficient of variation from the maximum variance of the estimator of the population mean used in Exercise 3.
6. The variance of the estimator for the population total is [@coc77]:
\begin{equation}
V(\hat{t}(z)) = \frac{1}{n} \sum_{j=1}^N p_j\left(\frac{t_j(z)}{p_j}-t(z)\right)^2 + \frac{1}{n} \sum_{j=1}^N \frac{M_j^2 (1-f_{2j})S^2_j}{m_j p_j} \;,
(\#eq:EstVarTotalTwostageCochran)
\end{equation}
with $\hat{t}(z)$ and $t(z)$ the estimated and the true population total of $z$, respectively, $t_j(z)$ the total of PSU $j$, and $p_j = M_j/M$. Use $m_j = m, \;j = 1, \dots, N$, and $f_{2j}=0$, i.e., sampling from infinite population, or sampling of SSUs within PSUs by simple random sampling *with* replacement from finite population. Derive the variance of the estimator for the population mean, Equation \@ref(eq:TrueVarEstMeanTwostage), from Equation \@ref(eq:EstVarTotalTwostageCochran).
## Primary sampling units selected without replacement
Similar to cluster random sampling, we may prefer to select the PSUs without replacement\index{pps sampling!without replacement (ppswor)}. This leads to less strong spatial clustering of the sampling points, especially with large sampling fractions of PSUs. Sampling without replacement of PSUs can be done with function `UPpivotal` of package **sampling** [@Tille2016], see Subsection \@ref(pivotalmethod). The second stage sample of SSUs is selected with function `strata` of the same package, using the PSUs as strata.
```{r}
library(sampling)
M_psu <- tapply(grdVoorst$z, INDEX = grdVoorst$psu, FUN = length)
n <- 6; m <- 10
pi <- inclusionprobabilities(M_psu, n)
set.seed(314)
sampleind <- UPpivotal(pik = pi, eps = 1e-6)
psus <- sort(unique(grdVoorst$psu))
sampledpsus <- psus[sampleind == 1]
mysample_stage1 <- grdVoorst[grdVoorst$psu %in% sampledpsus, ]
units <- sampling::strata(mysample_stage1, stratanames = "psu",
size = rep(m, n), method = "srswor")
mysample <- getdata(mysample_stage1, units)
mysample$ssunits <- units$ID_unit
mysample$pi <- n * m / M
print(mean_HT <- sum(mysample$z / mysample$pi) / M)
```
The population mean can be estimated with function `svymean` of package **survey**. To estimate the variance, a simple solution is to treat the two-stage cluster random sample as a pps sample *with replacement*, so that variance can be estimated with Equation \@ref(eq:VarEstMeanTwostage). With small sampling fractions of PSUs, the overestimation of the variance is negligible. With larger sampling fractions, Brewer's method is recommended, see @Berger2004 (option 2)\index{Brewer's variance estimator}.
```{r}
mysample$fpc1 <- n * M_psu[mysample$psu] / M
mysample$fpc2 <- m / M_psu[mysample$psu]
design_2stageppswor <- svydesign(id = ~ psu + ssunits, data = mysample,
pps = "brewer", fpc = ~ fpc1 + fpc2)
svymean(~ z, design_2stageppswor)
```
## Simple random sampling of primary sampling units {#TwostageSISI}
Suppose the PSUs are for some reason not selected with probabilities proportional to their size, but by simple random sampling without replacement. The inclusion probabilities of the PSUs then equal $\pi_j=n/N,\; j = 1, \dots, N$, and the population total can be estimated by (compare with Equation \@ref(eq:EstTotalClEqual))
\begin{equation}
\hat{t}(z) = \sum_{j=1}^n \frac{\hat{t}_j(z)}{\pi_j} = \frac{N}{n} \sum_{j=1}^n \hat{t}_j(z)\;,
(\#eq:EstTotalTwostageEqual)
\end{equation}
with $\hat{t}_j(z)$ an estimator of the total of PSU $j$. The population mean can be estimated by dividing this estimator by the population size $M$.
Alternatively, we may estimate the population mean by dividing the estimate of the population total by the *estimated* population size. The population size can be estimated by the $\pi$ estimator, see Equation \@ref(eq:EstPopulatonSizeClEqual). The $\pi$ estimator and the ratio estimator are equal when the PSUs are selected by ppswr, but not so when the PSUs of different size are selected with equal probabilities. This is shown below. First, a sample is selected by selecting both PSUs and SSUs by simple random sampling without replacement.
```{r}
library(sampling)
set.seed(314)
psus <- sort(unique(grdVoorst$psu))
ids_psu <- sample(length(psus), size = n, replace = FALSE)
sampledpsus <- psus[ids_psu]
mysample_stage1 <- grdVoorst[grdVoorst$psu %in% sampledpsus, ]
units <- sampling::strata(mysample_stage1, stratanames = "psu",
size = rep(m, n), method = "srswor")
mysample <- getdata(mysample_stage1, units)
mysample$ssunits <- units$ID_unit
```
The population mean is estimated by the $\pi$ estimator and the ratio estimator.
```{r, echo = FALSE, eval = FALSE}
N <- length(unique(grdVoorst$psu))
M_psu <- tapply(grdVoorst$z, INDEX = grdVoorst$psu, FUN = length)
pi_psu <- n / N
pi_ssu <- m / M_psu[mysample$psu]
mysample$pi <- pi_psu * pi_ssu
z_piexpanded <- with(mysample, z / pi)
tz_HT <- sum(z_piexpanded)
mz_HT <- tz_HT / M
M_HT <- sum(1 / mysample$pi)
mz_ratio <- tz_HT / M_HT
```
```{r}
N <- length(unique(grdVoorst$psu))
M_psu <- tapply(grdVoorst$z, INDEX = grdVoorst$psu, FUN = length)
pi_psu <- n / N
pi_ssu <- m / M_psu[mysample$psu]
est <- mysample %>%
mutate(pi = pi_psu * pi_ssu,
z_piexpanded = z / pi) %>%
summarise(tz_HT = sum(z_piexpanded),
mz_HT = tz_HT / M,
M_HT = sum(1 / pi),
mz_ratio = tz_HT / M_HT)
```
The $\pi$ estimate equals `r formatC(est$mz_HT, 1, format = "f")` g kg^-1^, and the ratio estimate equals `r formatC(est$mz_ratio, 1, format = "f")` g kg^-1^. The $\pi$ estimate of the population mean can also be computed by first estimating totals of PSUs, see Equation \@ref(eq:EstTotalTwostageEqual).
```{r}
tz_psu <- tapply(mysample$z / pi_ssu, INDEX = mysample$psu, FUN = sum)
tz_HT <- sum(tz_psu / pi_psu)
(mz_HT <- tz_HT / M)
```
The variance of the $\pi$ estimator of the population mean can be estimated by first estimating the variance of the estimator of the PSU totals:
\begin{equation}
\widehat{V}(\hat{t}(z)) = N^2\left(1-\frac{n}{N}\right)\frac{\widehat{S^2}(\hat{t}_i(z))}{n} \;,
(\#eq:EstVarTotalHTTwostageSI)
\end{equation}
and dividing this variance by the squared number of population units:
\begin{equation}
\widehat{V}(\hat{\bar{\bar{z}}}) = \frac{1}{M^2} \widehat{V}(\hat{t}(z)) \;,
(\#eq:EstVarMeanHTTwostageSI)
\end{equation}
as shown in the code chunk below (the final line computes the standard error).
```{r}
fpc <- 1 - n / N
v_tz <- N^2 * fpc * var(tz_psu) / n
(se_mz_HT <- sqrt(v_tz / M^2))
```
The ratio estimator of the population mean and its standard error can be computed with function `svymean` of package **survey**.
```{r}
mysample$fpc1 <- N
mysample$fpc2 <- M_psu[mysample$psu]
design_2stage <- svydesign(
id = ~ psu + ssunits, fpc = ~ fpc1 + fpc2, data = mysample)
svymean(~ z, design_2stage)
```
The estimated standard error of the ratio estimator is slightly smaller than the standard error of the $\pi$ estimator.
## Stratified two-stage cluster random sampling {#StratifiedTwostage}
The basic sampling designs stratified random sampling (Chapter \@ref(STSI)) and two-stage cluster random sampling can be combined into stratified two-stage cluster random sampling\index{Stratified random sampling!stratified two-stage cluster random sampling}. Figure \@ref(fig:STtwostage) shows a stratified two-stage cluster random sample from Voorst. The strata are groups of eight PSUs within 2 km $\times$ 1 km blocks, as before in stratified cluster random sampling (Figure \@ref(fig:ClVoorst)). The PSUs are 0.5 km squares (built-up areas, roads, etc. excluded), as before in (unstratified) two-stage cluster random sampling (Figure \@ref(fig:TwostageVoorst)). Within each stratum two times a PSU is selected by ppswr, and every time a PSU is selected, six SSUs (points) are selected by simple random sampling. The stratification avoids the clustering of the selected PSUs in one part of the study area. Compared to (unstratified) two-stage cluster random sampling, the geographical spreading of the PSUs is somewhat improved, which may lead to an increase of the precision of the estimated population mean.
```{r, echo = FALSE}
w <- 1000 #width of strata
s1bnd <- seq(min(grdVoorst$s1) + w, min(grdVoorst$s1) + 5 * w, w) + 12.5
grdVoorst$zone_stratum <- as.factor(findInterval(grdVoorst$s1, s1bnd))
levels(grdVoorst$zone_stratum) <- rep(c("a", "b", "c"), each = 2)
```
```{r}
n_h <- rep(2, 3)
m <- 6
set.seed(314)
stratumlabels <- unique(grdVoorst$zone_stratum)
mysample <- NULL
for (i in 1:3) {
grd_h <- grdVoorst[grdVoorst$zone_stratum == stratumlabels[i], ]
mysample_h <- twostage(sframe = grd_h, psu = "psu", n = n_h[i], m = m)
mysample <- rbind(mysample, mysample_h)
}
mysample$s1 <- jitter(mysample$s1, amount = cell_size / 2)
mysample$s2 <- jitter(mysample$s2, amount = cell_size / 2)
```
```{r STtwostage, echo = FALSE, out.width = "100%", fig.cap = "Stratified two-stage random sample from Voorst. Strata are groups of eight PSUs (0.5 km squares) within 2 km $\\times$ 1 km blocks. From each stratum two times a PSU is selected by ppswr, and six SSUs (points) are selected per PSU draw by simple random sampling."}
s1bnd <- seq(
from = min(grdVoorst$s1) + w,
to = min(grdVoorst$s1) + (11 * w),
by = w) + 12.5
ggplot(grdVoorst) +
geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = as.factor(psu))) +
scale_fill_viridis_d(alpha = 0.5) +
geom_point(data = mysample, mapping = aes(x = s1 / 1000, y = s2 / 1000), size = 1.5) +
geom_vline(xintercept = s1bnd[c(2, 4)] / 1000) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
coord_fixed() +
theme(legend.position = "none") +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
```
The population mean can be estimated in much the same way as with stratified cluster random sampling. With function `svymean` this is an easy task.
```{r}
N_h <- tapply(grdVoorst$psu, INDEX = grdVoorst$zone_stratum,
FUN = function(x) {
length(unique(x))
})
M_h <- tapply(grdVoorst$z, INDEX = grdVoorst$zone_stratum, FUN = length)
mysample$w1 <- N_h[mysample$zone_stratum]
mysample$w2 <- M_h[mysample$zone_stratum]
design_str2stage <- svydesign(id = ~ psudraw, strata = ~ zone_stratum,
weights = ~ w1 + w2, data = mysample, nest = TRUE)
svymean(~ z, design_str2stage)
```
```{r, echo = FALSE}
rm(list = ls())
```