Chapter 2 Some Basic Terms and Calculations
2.1 The Mean
One of the most familiar and commonly estimated population parameters is the mean. Given a simple random sample, the population mean is estimated by: \[\overline{X} = \frac{\sum_{i=1}^n X_{i}}{n}\]
where:
\(X_{i}\) = The observed value of the \(i^{th}\) unit in the sample.
\(n\) = The number of units in the sample.
\(\sum_{i=1}^n X_{i}\) means to sum up all \(n\) of the X-values in sample.
If there are \(N\) units in the population, the total of the X-values over all units in the population would be estimated by: \[\hat{T} = N \overline X\]
The circumflex (^) over the \(T\) is frequently used to indicate an estimated value as opposed to the true but unknown population value.
It should be noted that this estimate of the mean is used for a simple random sample. It may not be appropriate if the units included in the sample are not selected entirely at random.
Methods of computing confidence limits for the mean are discussed in the section on sampling.
2.2 Standard Deviation
Another commonly estimated population parameter is the standard deviation. The standard deviation characterizes dispersion of individuals about the mean. It gives us some idea whether most of the individuals in a population are close to the mean or spread out. The standard deviation of individuals in a population is frequently symbolized by \(\sigma\) (sigma). On the average, about two-thirds of the unit values of a normal population will be within 1 standard deviation of the mean. About 95 percent will be within 2 standard deviations and about 99 percent within 2.6 standard deviations.
We will seldom know or be able to determine \(\sigma\) exactly. However, given a sample of individual values from the population we can often make an estimate of \(\sigma\), which is commonly symbolized by \(s\). For a simple random sample of \(n\) units, the estimate is: \[s=\sqrt{\sum X^2 - \frac{(\sum X)^2}{n} \over n-1}\]
where:
\(\sum X^2 = \text {the sum of squared values of all individual measurements}\)
\((\sum X)^2 =\text {the square of the sum of all measurements}\)
This is equivalent to the formula \(s=\sqrt{\frac{\sum (X_{i} - \bar{X})^{2}}{n - 1}}\)
where:
\(\bar{X}=\text {the arithmetic mean }=\frac{\sum X}{n}\)
\((X - \bar{X})=\text {deviation of individual measurement from the mean of all measurements}\).
Here is an example: Ten individual trees in a loblolly pine plantation were selected at random and measured. Their diameters were \(9, 9, 11, 9, 7, 7, 10, 8, 9, \text { and }11\text { inches}\). Based on this sample, what is the arithmetic mean diameter and the standard deviation? Tabulating the measurements and squaring each of them:
\(X\) | \(X^2\) |
---|---|
9 | 81 |
9 | 81 |
11 | 121 |
9 | 81 |
7 | 49 |
7 | 49 |
10 | 100 |
8 | 64 |
9 | 81 |
11 | 121 |
\(\sum X = 90 \text { and }\sum X^2 = 828 \text { for the table above}\).
\(\text {Mean} = \bar{X}=\frac{\sum X}{n}=\frac{90}{10}=9.0\)
\(\text {Standard deviation} = s=\sqrt{\sum X^2 - \frac{(\sum X)^2}{n} \over n-1}=\sqrt{828 - \frac{90^2}{10} \over 9-1}=\sqrt\frac{18}{9}=1.414\)
Statisticians often speak in terms of variance rather than standard deviation. The variance is simply the square of the standard deviation. The population variance is symbolized by \(\sigma^2\) and the sample estimate of the variance by \(s^2\).
Using the sample range to estimate the standard deviation. – The standard deviation of the sample is an estimate of the standard deviation \((\sigma)\) of the population. The sample range \((R)\) may also be used to estimate the population standard deviation. Table 1 (Appendix) shows the ratio of the population standard deviation to the range for simple random samples of various sizes. In the example we’ve been using, the range is \(11-7=4\). For a sample of size 10, the table gives the value of the ratio \(\sigma \over R\) as 0.325. Therefore, \(\frac{\sigma}{4}= 0.325\) and \(\sigma = 1.3\) is an estimate of the true population standard deviation. Though easy to compute, this is an efficient estimator of \(\sigma\) only for very small samples (say less than 7 observations).
2.3 Coefficient of Variation
In nature, populations with large means often show more variation than populations with small means. The coefficient of variation \((C)\) facilitates comparison of variability about different sized means. It is the ratio of the standard deviation to the mean. A standard deviation of 2 for a mean of 10 indicates the same relative variability as a standard deviation of 16 for a mean of 80. The coefficient of variation would be 0.20 or 20 percent in each case.
In the problem discussed in the previous section the coefficient of variation would be estimated by: \[C=\frac{s}{\bar{X}}=\frac{1.414}{9.0}=0.157 \text { or 15.7 percent}\].
2.4 Standard Error of the Mean
There is usually variation among the individual units of a population. The standard deviation is a measure of this variation.
Since the individual units vary, variation may also exist among the means (or any other estimates) computed from samples of these units. Take, for example, a population with a true mean of 10. If we were to select four units at random, they might have a sample mean of 8. Another sample of four units from the same population might have a mean of 11, another 10.5, and so forth. Clearly it would be desirable to know the variation likely to be encountered among the means of samples from this population. A measure of the variation among sample means is the standard error of the mean. It can be thought of as a standard deviation among sample means; it is a measure of the variation among sample means, just as the standard deviation is a measure of the variation among individuals. As will be described in the section on simple random sampling, the standard error of the mean may be used to compute confidence limits for a population mean.
The computation of the standard error of the mean (often symbolized by \(s_\bar{x}\)) depends on the manner in which the sample was selected. For simple random sampling without replacement (i.e., a given unit cannot appear in the sample more than once) from a population having a total of \(N\) units the formula for the estimated standard error of the mean is: \[s_\bar{x}=\sqrt{\frac{s^2}{n}(1-\frac{n}{N})}\]
In the problem discussed on page 4 we had \(n=10\) and found that \(s=1.414\) or \(s^2=2\). If the population contained 1,000 trees, the estimated mean diameter \((\bar{X}=90 \text { inches})\) would have a standard error of: \[s_\bar{x}=\sqrt{\frac{2}{10}(1-\frac{10}{1000})}=\sqrt{0.198}=0.445\]
The term \((1-\frac{n}{N})\) is called the finite population correction or fpc. If sampling is with replacement (not too common) or if the sampling fraction \((\frac{n}{N})\) is very small (say less than 1/20), the fpc may be omitted and the standard error of the mean for a simple random sample is simply: \[s_\bar{x}=\sqrt\frac{s^2}{n}\]
The variance of the sample mean is simply the square of the standard error of the mean. \[s^2_\bar{x}=\frac{s^2}{n}(1-\frac{n}{N})\]
2.5 Covariance
Very often, each unit of a population will have more than a single characteristic. Trees, for example, may be characterized by their height, diameter, and form class. The covariance is a measure of the association between the magnitudes of two characteristics. If there is little or no association, the covariance will be close to zero. If the large values of one characteristic tend to be associated with the small values of another characteristic, the covariance will be negative. If the large values of one characteristic tend to be associated with the large values of another characteristic, the covariance will be positive. The population covariance of \(X\) and \(Y\) is often symbolized \(\sigma_{xy}\); the sample estimate by \(s_{xy}\).
Suppose that the diameter (inches) and age (years) have been obtained for a number of randomly selected trees. If we symbolize diameter by \(Y\) and age by \(X\), the sample covariance of diameter and age is given by: \[s_{xy}={\sum XY-\frac{(\Sigma X)(\Sigma Y)}{n} \over (n-1)}\]
This is equivalent to the formula: \[s_{xy}=\frac{\sum (X-\bar{X})(Y-\bar{Y})}{(n-1)}\]
If \(n=12\) and the \(Y\) and \(X\) values were as follows:
X | Y |
---|---|
20 | 4 |
40 | 9 |
30 | 7 |
45 | 7 |
25 | 5 |
45 | 10 |
30 | 9 |
40 | 6 |
20 | 8 |
35 | 6 |
25 | 4 |
40 | 11 |
- \(\text {Sum }X = 395\)
- \(\text {Sum }Y = 86\)
then \[s_{xy}={(4)(20)+(9)(40)+...+(11)(40)\frac{(86)(395)}{12} \over 12-1}=\frac{2960-2830.83}{11}=11.74\]
The positive covariance is consistent with the well known and economically unfortunate fact that the larger diameters tend to be associated with the older ages.
2.6 Simple Correlation Coefficient
The magnitude of the covariance, like that of the standard deviation, is often related to the size of the variables themselves. Units with large \(X\) and \(Y\) values tend to have larger covariances than units with small \(X\) and \(Y\) values. Also, the magnitude of the covariance would have been 298.196 instead of 11.74.
The simple correlation coefficient, a measure of the degree of linear association between two variable, is free of the effects of scale of measurement. It can vary from -1 to +1. A correlation of 0 indicates that there is no linear association (there may be a very strong nonlinear association, however). A correlation of +1 or -1 would suggest a perfect linear association. As for the covariance, a positive correlation implies that the large values of \(X\) are associated with the large values of \(Y\). If the large values of \(X\) are associated with the small values of \(Y\), the correlation is negative.
The population correlation coefficient is commonly symbolized by \(\rho\) (rho), and the sample-based estimate by \(r\). The population correlation coefficient is defined to be: \[\rho=\frac{\text {Covariance of X and Y}}{\sqrt{(\text {Variance of X})(\text {Variance of Y})}}\]
For a simple random sample, the sample correlation coefficient is computed as follows: \(\gamma=\frac{s_{xy}}{s_x*s_y}={{\sum {xy}} \over \sqrt{(\sum x^2)(\sum y^2)}}\)
where:
\(s_{xy} = \text {Sample covariance of X and Y}\)
\(s_x =\text {Sample standard deviation of X}\)
\(s_y =\text {Sample standard deviation of Y}\)
\(\Sigma xy =\text {Corrected sum of XY products}\)
\[=\sum XY-\frac{(\sum X)(\sum Y)}{n}\]
\(\Sigma x^2 = \text {Corrected sum of squares for X}\)
\[=\sum X^2-\frac{(\sum X)^2}{n}\]
\(\Sigma y^2 = \text {Corrected sum of squares for Y}\)
\[=\sum Y^2-\frac{(\sum Y)^2}{n}\]
For the values used to illustrate the covariance we have:
\[\sum{xy}=(4)(20)+(9)(40)+...+(11)(40)-\frac{(86)(395)}{12}=129.1667\] \[\sum y^2=4^2+9^2+...+11^2-\frac{86^2}{12}=57.667\] \[\sum x^2=20^2+40^2+...+40^2-\frac{395^2}{12}\] So, \[\gamma={{129.1667} \over \sqrt{(57.667)(922.9167)}}=\frac{129.1667}{230.6980}=0.56\]
Correlation or chance.–The computed value is a statistic such as the correlation coefficient depends on which particular units were selected for the sample. Such estimates will vary sample to sample. More important, they will usually vary from the population value which we try to estimate.
In the above example, the sample correlation coefficient was 0.56. Does this mean that there is a real linear association between \(Y\) and \(X\)? Or could we get a value as large as this just by chance when sampling a population in which there is no linear association between \(Y\) and \(X\) (i.e., a population for which \(\rho=0\))?
This can be tested by referring to table 7 (Appendix). The column headed “Degrees of freedom” refers to the sample size. A correlation coefficient estimated from a simple random sample of \(n\) units will have \((n-2)\) degrees of freedom. Looking in the tow for 10 degrees of freedom we find in the column headed “5%” a value of 0.576. This says that in sampling from a population for which \(\rho=0\) we would get a sample value as large as 0.576 just by chance about 5 percent of the time. Sample values smaller than 0.576 could occur more often than this. Thus we might conclude that are sample \(r=0.56\) could have been obtained by chance in sampling from a population with a true correlation of zero.
This test result is usually summarized by saying that the sample correlation coefficient is not significant at the 0.05 level. In statistical terms, we tested the hypotheses that \(\rho=0\) and failed to reject the hypothesis at the 0.05 level. This is not exactly the same as saying that we reject the hypothesis or that we have proved that \(\rho=0\). The distinction is subtle but real.
For a sample correlation larger than 0.576 we might decide that the departure from a value of zero is larger than we would expect by chance. Statistically we would reject the hypothesis that \(\rho=0\).
2.7 Variance of a linear function
Quite often we will want to combine variables or population estimates in a linear function. For example, if the mean timber volume per acre has been estimated as \(\bar X\), then the total volume on \(M\) acres will be \(M\bar X\); the estimate of total volume is a linear function of the estimated-mean volume. If the estimate of cubic volume per acre in sawtimber is \(\bar X_1\) and of pulpwood above the sawtimber top is \(\bar X_2\), then the estimate of total volume cubic foot volume per acre is \(\bar X_1+\bar X_2\). If on a given tract the mean volume per half-acre is \(\bar X_1\) for spruce and the mean volume per quarter-acre is \(\bar X_2\) for yellow birch, then the estimated total volume per acre of spruce and birch would be \(2\bar X_1+4\bar X_2\).
In general terms, a linear function of three variables (say \(X_1, X_2, \text { and } X_3\)) can be written as \(L=a_1X_1+a_2X_2+a_3X_3\)
where:
- \(a_1\), \(a_2\), and \(a_3\) are constants.
If the variances are \(s_1^2, s_2^2,\text { and }s_3^2\) (for \(X_1, X_2, \text { and }X_3\) respectively) and the covariances are \(s_{1,2}, s_{1,3},\text { and }s_{2,3}\), then the variance of \(L\) is given by:
\[s_L^2=a_1^2s_1^2+a_2^2s_2^2+a_3^2s_3^2+2(a_1a_2s_{1,2}+a_1a_3s_{1,3}+a_2a_3s_{2,3})\]
The standard deviation (or standard error) of \(L\) is simply the square root of this.
The extension of the rule to cover any number of variables should be fairly obvious.
Some examples
I. The sample mean volume per acre for a 10,000-acre tract is \(\bar X=5,680 \text { board feet}\) with a standard error of \(s_\bar X=632\) (so \(s_\bar X^2 =399,424\)). The estimated total volume is: \[L=10,000(\bar X)=56,800,000 \text { board feet}\]
The variance of this estimate would be: \[s_L^2=(10,000)^2(s_\bar x^2)=39,942,400,000,000\]
Since the standard error of an estimate is the square root of its variance, the standard error of the estimated total is: \[s_L={\sqrt {s_L^2}}=6,320,000\]
II. In 1955 a random sample of 40 one-quarter-acre circular plots was used to estimate the cubic foot volume of a stand of pine. Plot centers were monumented for possible relocation at a later time. The mean volume per plot was \(\bar X_1=225 \text { cubic feet}\). The plot variance was \(s^2_{X_1}=8,281\) so that the variance of the mean was \(s^2_{\bar x_1}=8,281/40=207.025\).
In 1960 a second inventory was made using the same plot centers. This time, however, the circular plots were only one-tenth acre. The mean volume per plot was \(\bar X_2=122 \text { cubic feet}\). The plot variance was \(s_{x^2_2}=6,084\), so the variance of the mean was \(s^2_{x_2}=152.100\). The covariance of initial and final plot volumes was \(s_{{x1},{x2}}=4,259\), making the covariance of the means \(s_{\bar {x1},\bar {x2}}=4,259/40=106.475\).
The net periodic growth per acre would be estimated as: \(G=10\bar X_2-4\bar X_1=10(122)-4(225)=320 \text { cubic feet per acre}\).
By the rule for linear function the variance of \(G\) would be: \[s_G^2=(10)^2s^2_{\bar x_2}+(-4)^2s^2_{\bar x_1}+2(10)(-4)s_{\bar x_1*\bar x_2}\]
\[=100(152.100)+16(207.025)-80(106.475)\] \[=10,004.4\]
In this example there was a statistical relationship between the 1960 and 1955 means because the same plot locations were used in both samples. The covariance of the means \((s_{\bar x1, \bar x2})\) is a measure of this relationship. If the plots in 1960 had been located at random rather than at the 1955 locations, the two means would have been considered statistically independent and their covariance would have been set at zero. In this case the equation for the variance of the net periodic growth per acre \((G)\) would reduce to: \[s^2_G=(10)^2s_{\bar x2}^2+(-4)^2s_{\bar x1}^2\] \[=100(152.100)+16(207.025)=18,522.4\]