Chapter 3 Sampling Measureable Variables
3.1 Simple Random Sampling
Most foresters are familiar with simple random sampling. As in any sampling system, the aim is to estimate some characteristic of a population without measuring all of the population units. In a simple random sample of size \(n\), the units are selected so that every possible combination of \(n\) units has an equal chance of being selected. If sampling is with replacement, then at each stage of the sampling all units should have an equal chance of being selected. If sampling is without replacement, then at any stage of the sampling each unused unit should have an equal chance of being selected.
Sample estimates of the population mean and total.–From a population of \(N=100\) units, \(n=20\) units were selected at random and measured. Sampling was without replacement–once a unit had been included in the sample it could not be selected again. The unit values were:
10 9 10 9 11
16 11 7 12 12
11 3 5 11 14
8 13 12 20 10
Sum of all 20 random units = 214
From this sample we estimate the population mean as: \[\bar X=\frac{\sum X}{n}=\frac{214}{20}=10.7\]
A population of \(N=100\) units having a mean of 10.7 would then have an estimated total of: \[\hat T=N\bar X=100(10.7)=1,070\]
3.1.1 Standard Errors
The first step in calculating a standard error is to obtain an estimate of the population variance \((\sigma^2)\) or standard deviation \((\sigma)\). As noted in a previous section, the standard deviation for a simple random sample is estimated by: \[s=\sqrt{{\sum X^2-{\frac{(\sum X)^2}{n}}} \over n-1}=\sqrt {{10^2+16^2+...+10^2-\frac{214^2}{20}} \over 19}=\sqrt {13.4842}=0.734\]
From the formula for the variance of a linear function we find that the variance of the estimated total is: \[s^2_\hat T=N^2s^2_\bar x\]
The standard error of the estimated total is the square root of this, or: \[s_\hat T=Ns_\bar x=100(0.734)=73.4\]
3.1.2 Confidence Limits
We have it on good authority that “you can fool all of the people some of the time.” The oldest and simplest device for misleading folks is the barefaced lie. A method that is nearly as effective and far more subtle is to report a sample estimate without any indication of its reliability.
Sample estimates are subject to variation. How much they vary depends primarily on inherent variability of the population \((\sigma^2)\) and on the size of the sample \((n)\) and of the population \((N)\).
The statistical way of indicating the reliability of an estimate is to establish confidence limits. For estimates made from normally distributed populations, the confidence limits are given by: \[\text {Estimate} \pm (t)(\text {Standard Error})\]
For setting confidence limits on the mean and total we already have everything we need except for the value of \(t\), and that can be obtained from the table of the \(t\) distribution (table 2 in the appendix). In this table, the column headed df (degrees of freedom) refers to the size of the sample. For the mean (or total) of a simple random sample we would select a \(t\) value with \((n-1)\) degrees of freedom. The columns labeled Probability refer to the kind of odds we demand. If we want to say that the true mean (or total) falls within certain limits unless a one-in-twenty chance has occurred, we use the \(t\) value in the column headed 0.05. If we want to say that the true value lies within a set of limits unless a one-in-one hundred chance has occurred, we select \(t\) from the column headed 0.01.
In the previous example the sample of \(n=20\) units had a mean of \(\bar X= 10.7\) and a standard error of \(s_\bar x=0.734\). For 95-percent confidence limits on the mean we would use a \(t\) value from the 0.05 column and the row for 19 degrees of freedom (df). As \(t_{0.5}=2.093\), the confidence limits are given by: \[\bar X \pm (t)(s\bar x)=10.7 \pm (2.093)(0.734)=9.16 \text { to } 12.24\]
This says that unless a one-in-twenty chance has occurred in sampling, the population mean is somewhere between 9.16 and 12.24. It does not say where the mean of future samples from this population might fall. Nor does it say where the mean may be if mistakes have been made in the measurements.
For 99-percent confidence limits we find \(t_{0.1}=2.861\) (with 19 degrees of freedom), and so the limits are: \[10.7 \pm (2.861)(0.734)=8.6 \text { to }12.8\]
These limits are wider, but they are more likely to include the true population mean.
For the population total the confidence limits are:
- 95-percent limits – \(1,070 \pm (2.093)(73.4)=916 \text { to } 1,224\)
- 99-percent limits – \(1,070 \pm (2.861)(73.4)=860 \text { to } 1,280\)
For large samples \((n>60)\) the 95-percent limits are closely approximated by: \[\text {Estimate} \pm (2)(\text {Standard Error})\]
and the 99-percent limits by: \[\text {Estimate} \pm (2.6)(\text {Standard Error})\]
3.1.3 Sample size
Samples cost money. So do errors. The aim in planning a survey should be to take enough observations to obtain the desired precision – no more, no less.
The number of observations needed in a simple random sample will depend on the precision desired and the inherent variability of the population being sample. Since sampling precision is often expressed in terms of confidence interval on the mean, it is not unreasonable in planning a survey to say that in the computed confidence interval:
\[\bar X \pm ts_\bar x\]
we would like to have the \(ts_\bar x\) equal to or less than some specified value \(E\), unless a one-in-twenty (or one-in-one hundred) chance has occurred in sampling. That is, we want:
\[ts_\bar x=E\]
or since \(s_\bar x=\frac{s}{\sqrt n}\), we want:
\[t(\frac{s}{\sqrt n})=E\]
Solving this for \(n\) gives the desired sample size.
\[n=\frac{t^2s^2}{E^2}\]
To apply this equation we need to have an estimate \((s^2)\) of the population variance and a value for students \(t\) at the appropriate level of probability.
The variance estimate can be a real problem. One solution is to make the sample survey in two stages/ In the first stage, \(n_1\) random observations are made and from these an estimate \((s^2)\) of the variance is computed. Then this value is plugged into the sample size equation:
\[n=\frac{t^2s^2}{E^2}\]
where:
- \(t\) has \(n_1-1\) degrees of freedom and is selected from table 2 of the appendix.
The computed value of \(n\) is the total size of sample needed. As we have already observed \(n_1\) units, this means that we will have to observe \((n-n_1)\) additional units.
If pre-sampling as described above is not feasible then it will be necessary to make a guess at the variance. Assuming our knowledge of the population is such that the guessed variance (\(s^2\)) can be considered fairly reliable, then the size of sample (\(n\)) needed to estimate the mean to within \(\pm E\) units is approximately: \[n=\frac {4s^2}{E^2}\text {for 95 percent confidence}\]
and \[n=\frac{20(s^2)}{3E^2}\text {for 99 percent confidence}.\]
Less reliable variance estimates could be doubled (as a safety factor) before applying these equations. In many cases the variance estimate may be so poor as to make sample size computation just so much statistical window dressing.
When sampling is without replacement (as it is in most forest sampling situations) the sample size estimates given above apply to populations with an extremely large number (\(N\)) of units so that the sampling fraction (\(n/N\)) is very small. If the sampling fraction is not small (say \(\frac{n}{N} \ge 0.5\)) then the sample size estimates should be adjusted. This adjusted value of \(n\) is: \[n_a={n \over 1 + \frac{n}{N}}\]
Warning! It is important that the specified error \((E)\) and the estimated variance (\(s^2\)) be on the same scale of measurement. We could not, for example, use a board-foot variance in conjunction with an error expressed in cubic feet. Similarly, if the error is expressed in volume per acre, the variance must be put on a per-acre basis. Suppose that we plan to use quarter-acre plots in a survey and estimate the variance among plot volumes to be \(s^2=160,000\). If the error limit is \(E=500\) feet per acre, we must convert the variance to an acre basis or the error to a quarter-acre basis. To convert a quarter-acre volume to a per-acre basis we multiply by 4, and to convert a quarter-acre variance to an acre variance we multiply by 16. Thus, the variance would be 2,560,00 and the sample size formula would be: \[n=\frac{t^2(2,560,00)}{500^2}=t^2(10.24)\]
Alternatively, we can leave the variance alone and convert the error statement from an acre to a quarter-acre basis: i.e., \(E=125\). Then the sample-size formula is:
\[n=\frac{t^2(160,000)}{(125)^2}=t^2(10.24)\] , as before.
The problem of units of measure is not difficult, but the unwary can easily go astray.
3.2 Stratified Random Sampling
In stratified sampling, a population is divided into subpopulations (strata) of known size, and a simple random sample of at least two units is selected in each subpopulation. The approach has several advantages. For one thing, if there is more variation between subpopulations than within them, the estimate of the population mean will be more precise than that given by a simple random sample of the same size. Also, it may be desirable to have separate estimates for each subpopulation (e.g., in timber types or administrative subunits). And it may be administratively more efficient to sample by subpopulations.
Example:
A 500-acre forested area was divided into three strata on the basis of timber type. A simple random sample of 0.2-acre plots was taken in each stratum, and the means, variances, and standard errors were computed by the formulate for a simple random sample. These results, along with the size (\(N_h\)) of each stratum (expressed in number of 0.2-acre plots), are:
Type | Stratum number (\(h\)) | Stratum size (\(N_h\)) | Sample size (\(n_h\)) | Stratum mean (\(\bar X_h\)) | Within stratum variance (\(s^2_h\)) | Squared standard error of the mean (\({s_\bar x^2}_h\)) |
---|---|---|---|---|---|---|
Pine | 1 | 1350 | 30 | 251 | 10860 | 353.96 |
Upland Hardwood | 2 | 700 | 15 | 164 | 9680 | 631.50 |
Bottomland Hardwood | 3 | 450 | 10 | 110 | 3020 | 295.29 |
Sum | 2500 |
The squared standard error of the mean for stratum \(h\) is computed by the formula given for the simple random sample: \[s^2_\bar x=\frac{s^2_h}{n_h}(1-\frac{n_h}{N_h})\]
Thus, for stratum 1 (pine type), \[s^2_\bar x=\frac{10860}{30}(1-\frac{30}{1350})=353.96\]
Where the sampling fraction (\(n_h/N_h\)) is small, the fpc can be omitted. With this data, the population mean is estimated by: \[\bar X_{st}=\Sigma \frac{N_h\bar X_h}{N}\]
where \(N=\Sigma N_h\)
For this example we have: \[\bar X_{st}={\frac{N_1\bar X_1+N_2\bar X_2+N_3\bar X_3}{N}}=\frac{1,350(251)+700(164)+450(110)}{2,500}=201.26\]
The formula for the standard error of the stratified mean is cumbersome but not complicated: \[s_{\bar x_{st}}=\sqrt{\frac{1}{N^2} [\sum N^2_hs^2{_\bar x}_{h}]} = \sqrt{\frac{(1350)^2(353.96)+(700)^2(631.50)+(450)^2(295.29)}{(2,500)^2}}=12.74\]
If the sample size is fairly large, the confidence limits on the mean are given by:
95-percent confidence limits \(=\bar X_{st}\pm 2s_{\bar x_{st}}\)
99-percent confidence limits \(=\bar X_{st}\pm 2.6s_{\bar x_{st}}\)
There is no simple way of compiling the confidence limits for small samples.
3.2.1 Sample allocations
If a sample of \(n\) units is taken, how many units should be selected in each stratum? Among several possibilities, the most common procedure is to allocate the sample in proportion to the size of the stratum; in a stratum having two-fifths of the units of the population we would take two-fifths of the samples. In the population discussed in the previous example the proportional allocation of the 55 sample units would have been (and was) as follows:
Stratum | Relative size (\(N_h/N\)) | Sample allocation |
---|---|---|
1 | 0.54 | 29.7 or 30 |
2 | 0.28 | 15.4 or 15 |
3 | 0.18 | 9.9 or 10 |
Sum | 1.00 | 55 |
For the proportional allocation the number of sample units to be selected in stratum \(h\) is:\[n_h=(\frac {N_h}{N})n\]
Some other possibilities are equal allocation, allocation proportional to estimated value, and optimum allocation. In optimum allocation an attempt is made to get the smallest standard error (of \(\bar X^f_{st}\)) possible for a sample of \(n\) units. This is done by sampling more heavily in the strata having a larger variation. The equation for optimum allocation is: \[n_h=(\frac {N_hs_h}{\sum N_hs_h})n\]
Optimum allocation obviously requires estimates of the within-stratum variances–information that may be difficult to obtain.
A refinement of optimum allocation is to take sampling cost differences into account and allocate the sample so as to get the most information per dollar. If the cost per sampling unit in stratum \(h\) is \(c_h\), the equation is: \[n_h=({\frac {N_hs_h}{\sqrt c_h} \over \Sigma(\frac{N_hs_h}{\sqrt c_h})})n\]
3.2.2 Sample size
To estimate the size of sample to take for a specified error at a given level of confidence, it is first necessary to decide on the method of allocation. Ordinarily, proportional allocation is the simplest and perhaps the best choice. With proportional allocation, the size of sample needed to be within \(\pm E\) units of the true value at the 0.05 probability level can be approximated by: \[n={N(\sum N_hs_h^2) \over \frac{N^2E^2}{4} + \sum N_hs_h^2}\]
For the 0.01 probability level, use 6.76 in place of 4.
To illustrate, assume that prior to sampling the 500-acre forest, we had decided that we wish to estimate the mean volume per acre to within \(\pm 100\) cubic feet per acre unless a 1-in-20 chance occurs in sampling. As we plan to sample with 0.2-acre basis. Therefore, \[E=20\].
From previous sampling the stratum variances for 0.2-acre volumes are estimated to be:
\[s_1^2=8,000\]
\[s_2^2=10,000\]
\[s_3^2=5,000\]
The stratum sizes are known to be as previously shown:
\[N_1=1,350\] \[N_2=700\] \[N_3=450\] \[N=2,500\]
Therefore,
\[n={2,500[(1350)(8000)+(700)(10000)+(450)(5000)] \over \frac{(2500)^2(20)^2}{4}+[(1350)(8000)+(700)(10000)+(450)(5000)]}=77.7 \text { or } 78\]
The 78 sample units would now be allocated to the strata by the formula:
\[n_h=(\frac{N_h}{N})n\]
giving \[n_1=42\] \[n_2=22\] \[n_3=14\]