Chapter 4 Sampling Discrete Variables

4.1 Random Sampling

The sampling methods discussed in the previous sections apply to data that are on a continuous or nearly continuous scale of measurement. These methods may not be applicable if each unit observed is classified as alive or dead, germinated, infected or not infected. Data of this type may follow what is known as the binomial distribution. They require slightly different statistical techniques.

As an illustration, suppose that a sample of 1,000 seeds was selected at random and tested for germination. If 480 of the seeds germinated, the estimated viability for the lot would be:

\[\bar p=\frac{480}{1000}=0.48, \text { or } 48 \text { percent}\]

Confidence limits for the population viability are easily obtained from appendix table 5: look in the “fraction observed” column for 0.48, and then move crosswise to the column for a sample of size 1,000. The figures in this column of the 95-percent side of the table are 45 and 51. Thus, unless a one-in-twenty chance has occurred in sampling, the germination percent for the population is between 45 and 51. The 99-percent confidence limits, obtained in the same manner, are 44 and 52.

If the sample size is $n=10,15,20,30, \text { or } 50$, it will be necessary to look in the far left column for the number actually observed (rather than the fraction observed). Then in the appropriate sample-size column will be found the confidence limits for the fraction observed. Thus, for a germination of 24 seeds in a sample of 50 (so $\bar p=0.48$) the 95-percent confidence limits would be 0.34 and 0.63.

For large samples (say $n>250$) with proportions greater than 0.20 but less than 0.80, approximate confidence limits can be obtained another way. First we compute the standard error of $\bar p$ by the equation:

\[s_\bar p=\sqrt{\frac{\bar p(1-\bar p)}{(n-1)}(1-\frac{n}{N})}\]

Then, the 95-percent confidence limits are given by:

\[ \text {95-percent confidence interval}:=\bar p \pm [2(s_\bar p)+\frac {1}{2n}]\]

Applying this to the above example we get:

\[s_\bar p=\sqrt \frac {(0.48)(0.52)}{999} \text { (fpc ignored)} =0.0158\]

And,

\[\text {95-percent confidence interval} =0.48 \pm [2(0.0158)+\frac {1}{2(1,000)}]=0.448 \text { to } 0.512\]

The 99-percent confidence limits are approximated by

\[\text {99-percent confidence interval}=\bar p \pm [2.6s_{\bar p} + \frac {1}{2n}]\]

4.1.1 Sample size

Table 5 can also be used to estimate the number of units that would have to be observed in a simple random sample in order to estimate a population proportion with some specified precision.

Suppose, for example, that we wanted to estimate the germination percent for a population to within plus or minus 10 percent (or 0.10) at the 95-percent confidence level. The first step is to guess about what the proportion of seed germinating will be. If a good guess is not possible, then the safest course is to guess $\bar p=0.50$ as this will give the maximum sample size.

Next, pick any of the sample sizes given in the table (10,15,20,30,50,100,250, and 1,000) and look at the confidence interval for the specified value of $\bar p$. Inspection of these limits will tell whether or not the precision will be met with a sample of this size or if a larger or smaller sample would be more appropriate.

Thus, if we guess $\bar p=0.2$, then in a sample of n=50 we would expect to observe (0.2)(50)=10, and the table says that the 95-percent confidence limits on $\bar p$ would be 0.10 and 0.34. Since the upper limit is not within 0.10 of $\bar p$, a larger sample would be needed. For a sample of $n=100$ the limits are 0.13 to 0.29. Since both of these values are within 0.10 of $\bar p$, a sample of 100 would be adequate.

If the table indicates the need for a sample of over 250, the size can be approximated by:

\[n=\frac {4(\bar p)(1-\bar p)}{E^2}\text {, for 95-percent confidence}\]

or,

\[n=\frac {20(\bar p)(1-\bar p)}{3E^2}\text {, for 99-percent confidence}\]

where $E = $ The precision with which $\bar p$ is to be estimated (expressed in the same for as $\bar p$, either percent or decimal).

4.2 Cluster Sampling for Attributes

Simple random sampling of discrete variable is often difficult or impractical. In estimating plantation survival, for example, we could select individual trees at random and examine them, but it wouldn’t make much sense to walk down a row of of planted trees in order to observe a single member of that row. It would usually be more reasonable to select rows at random and observe all of the trees in the selected row.

Seed variability is often estimated by randomly selecting several lots of 100 or 200 seeds each and recording for each lot the percentage of the seeds that germinate.

These are examples of cluster sampling; the unit of observation is the cluster rather than the individual tree or single seed. The value attached to the unit is the proportion having a certain characteristic rather than the simple fact of having or not having that characteristic.

If the clusters are large enough (say over 100 individuals per cluster) and nearly equal in size, the statistical methods that have been described for measurement variables can often be applied. Thus, suppose that the germination percent fo a seedlot is estimated by selecting $n=10$ sets of 200 seeds each and observing the germination percent for each set. If the results were:

Set	1	2	3	4	5	6	7	8	9	10	Sum
Germination percent ($p$)	78.5	82.0	86.0	80.5	74.5	78.0	79.0	81.0	80.5	83.5	803.5

Then the germination percent is estimated by:

\[\bar p=\frac {\Sigma p}{n}=\frac {803.5}{10}=80.35 \text { percent}\]

The standard deviation of $p$ is:

\[s_p=\sqrt {\Sigma p^2-\frac {(\Sigma p)^2}{n} \over n-1}=\sqrt {78.5^2+...+83.5^2-\frac {(803.5)^2}{10} \over 9}=\sqrt {10.002778} = 3.163\]

And the standard error for $\bar p$ is:

\[s_\bar p=\sqrt {\frac {s^2_p}{n}(1-\frac {n}{N})}=\sqrt {\frac {10.002778}{10}}=1.000 \text { (fpc ignored)}\]

Note that $n$ and $N$ in these equations refer to the number of clusters, not to the number of individuals.

The 95-percent confidence interval, computed by the procedure for the continuous variables:

\[=\bar p \pm (t_{0.05})(s_\bar p), (t \text { has }(n-1)=9 \text { df})=80.35 \pm 2.262(1.000)=78.1 \text { to } 82.6\]

4.2.1 Transformations

The above method of computing confidence limits assumes that the individual percentages follow something close to ao normal distribution with homogenous variance (i.e., same variance regardless of the size of the percent). If the clusters are small (say less than 100 individuals per cluster) or some of the percentages are greater than 80 or less than 20, the assumptions may not be valid and the computed confidence limits will be unreliable.

In such cases it may be desirable to compute the transformation:

\[y=\text {arc sine }\sqrt {percent}\]

and to analyze the transformed variable. The transformation is easily made by means of table 6. Thus in previous example we would have

percent	78.5	82.0	86.0	80.5	74.5	78.0	79.0	81.0	80.5	83.5	Sum
arc sine $\sqrt{percent}$	62.4	64.9	68.0	63.8	59.7	62.0	62.7	64.2	63.8	66.0	637.5

Then working with the transformed variables,

\[\bar y=\frac {637.5}{10}=63.75 \text{, corresponding to a mean percentage of 80.4}\]

The variance of $y$ is:

\[s^2_y={62.4^2+...+66.0^2-\frac {637.5^2}{10} \over 9}=5.227222\]

And the standard error of $\bar y$ is:

\[s_\bar y={\sqrt \frac {5.227222}{10}}=0.723\]

The 95-percent confidence interval on mean $y$ is:

\[\bar y \pm (t_{0.05})(s_\bar y)=63.75 \pm (2.262)(0.723)=62.11 \text { to } 65.39\]

These limits correspond to percentages of 78.1 to 82.7.

Because the clusters are fairly large and the value of $\bar p$ close to 0.50, the transformation did not have much effect in this case.