Chapter 5 Chi-Square Tests

5.1 Test of Independence

Individuals are often classified according to two (or more) distinct systems. A tree can be classified as to species and at the same time according to whether or not it is infected with some disease. A milacre plot can be classified as to whether or not it is stocked with adequate reproduction and whether it is shaded or not shaded. Given such a cross-classification, it may be desirable to know whether the classification of an individual according to one system is independent of its classification by the other system. In the species-infection classification, for example, independence of species and infection would be interpreted to mean that there is no difference in infection rate between species (i.e., infection rate does not depend on species).

The hypothesis that two or more systems of classification are independent can be tested by chi-square. The procedure can be illustrated by a test of three termite repellents. A batch of 1,500 wooden stakes was divided at random into three groups of 500 each, and each group received a different termite-repellent treatment. The treated stakes were driven into the ground, with the treatment at any particular stake location being selected at random. Two years later the stakes were examined for termites. The number of stakes in each classification is shown in the following 2 by 3 (two rows and three columns) contingency table:

Classification Group I Group II Group III Subtotals
Attacked by termites 193 148 210 551
Not attacked 307 352 290 949
Subtotals 500 500 500 1500

If the data in the table is symbolized as shown below:

Classification I II III
Attacked \(a_1\) \(a_2\) \(a_3\) A
Not attacked \(b_1\) \(b_2\) \(b_3\) B
Subtotals \(T_1\) \(T_2\) \(T_3\) G

The test of independence is made by computing:

\[\chi^2=\frac {1}{(A)(B)} \Sigma_{i=1}^3(\frac {[a_1B-b_iA]^2}{T_i})\]

\[=\frac {1}{(551)(949)}[\frac {((193)(949)-(307)(551))^2}{500}+...+\frac {((210)(949)-(290)(551))^2}{500}]=17.66\]

This result is compared to the tabular value of \(\chi^2\) (table 4) with \((c-1)\) degrees of freedom, where \(c\) is the number of columns in the table of data. If the computed value exceeds the tabular value given in the 0.05 column, the difference among treatments is said to be significant at the 0.05 level (i.e., we reject the hypothesis that attack classification is independent of treatment classification).

In this example, the computed value of 17.66 (2 degrees of freedom) exceeds the tabular value in the 0.01 column, and so the difference in rate of attack among treatments is significant at the 1-percent level. Examination of the data suggests that this is primarily due to the lower rate of attack on the Group II stakes.

The r by c contingency table. The above example is a simple case of the chi-square test of independence in an \(r\) by \(c\) table (i.e., \(r\) rows and \(c\) columns). Thus, if a number of randomly selected forest stands were classified as to soil group and forest type the results might be as follows:

Soil group Forest Type I Forest Type II Forest Type III Subtotal
1 27 48 62 137
2 32 46 67 145
3 26 51 61 138
Subtotal 85 145 190 420

If the \(r\) by \(c\) table is represented in symbols:

Soil group Forest Type I Forest Type II Forest Type III Subtotal
1 \(a_{11}\) \(a_{12}\) \(a_{13}\) \(S_1\)
2 \(a_{21}\) \(a_{22}\) \(a_{23}\) \(S_2\)
3 \(a_{31}\) \(a_{32}\) \(a_{33}\) \(S_3\)
Subtotal \(T_1\) \(T_2\) \(T_3\) G

then the test of independence is: \[\chi^2=\frac{1}{G} \Sigma_{i,j}(\frac {Ga_{ij}-S_iT_j)^2}{S_iT_j}),\text { with } (r-1)(c-1) \text { degrees of freedom}\]

In this example: \[\chi^2_{4 df}=\frac {1}{420}[\frac {((420)(27)-(137)(85))^2}{(85)(137)}+...+\frac {((420)(61)-(138)(190))^2}{(138)(190)}]=1.031\]
which is not significant at the 0.05 level. Thus, the test has failed to demonstrate any real association between forest types and soil groups.

The test of independence can be extended to more than two classification systems, by formulating meaningful hypotheses may be difficult.

5.2 Test of a Hypothesized Count

A geneticist hypothesized that, if a certain cross were made, the progeny would be of four types, in the proportions:

\[A = 0.48\] \[B=0.32\] \[C=0.12\] \[D=0.08\]

The actual segregation of 1,225 progeny is shown below, along with the numbers expected according to the hypothesis.

Type A B C D Total
Number \((X_i)\) 542 401 164 118 1225
Expected \((m_i)\) 588 392 147 98 1225

As the observed counts differ from those expected, we might wonder if the hypothesis is false. Or, can departures as large as this occur strictly by chance?

The chi-square test is:

\[\chi^2=\Sigma^k_{i-1}(\frac {(X_i-m_i)^2}{m_i}),\text { with (k-1) degrees of freedom}\]

where:

\(k\)=The number of groups recognized
\(X_i\)=The observed count for the \(i^{th}\) group
\(m_i\)=The count expected in the \(i^{th}\) group if the hypothesis is true.

For the above data, \[\chi^2_{3df}=\frac {(542-588)^2}{588}+\frac {(401-392)^2}{392}+\frac {(164-147)^2}{147}+\frac {(118-98)^2}{98}=9.85\]

This value exceeds the tabular \(\chi^2\) with 3 degrees of freedom at the 0.05 level. Hence the hypothesis would be rejected (if the geneticist believed in testing at the 0.05 level).

5.3 Bartlett’s Test of Homogeneity of Variance

Many of the statistical methods described later are valid only if the variance is homogeneous. The t test of the following section assumes that the variance is the same for each group, and so does the analysis of variance. The fitting of an unweighted regression as described in the last section also assumes that the dependent variable has the same degree if variability (variance) for all levels of the independent variables.

Bartlett’s test offers a means of evaluating this assumption. Suppose that we have taken random samples in each of four groups and obtained variances \((s^2)\) of 84.2, 63.8, 88.6, and 72.1 based on samples of 9, 21, 6, and 11 units, respectively. We would like to know if these variances could have come from populations all having the same variance. The quantities needed for Bartlett’s test are tabulated here:

Group Variance \((s^2)\) (n-1) Corrected sum of squares SS \(\frac {1}{n-1}\) \(\text {log } s^2\) \((n-1)(\text {log } s^2)\)
1 84.2 8 673.9 0.125 1.92531 15.40248
2 63.8 20 1276.0 0.050 1.80482 36.09640
3 88.6 5 443.0 0.200 1.94743 9.73715
4 72.1 10 721.0 0.100 1.85794 18.57940
\(k=4\) groups Sums 43 3113.6 0.475 79.81543

where:

\(k = \text {The number of groups }\)\((=4)\).
\(SS = \text {The corrected sum of squares}\) = \((\Sigma X^2-\frac {(\Sigma X)^2}{n})=(n-1)s^2\)

From this we compute the pooled within-group variance:

\[\bar s^2=\frac {\Sigma SS_i}{\Sigma(n_i-1)}=\frac {3113.6}{43}=72.4093\]

and \(\log \bar s^2=1.85979\)

Then the test of homogeneity is: \[\chi^2_{(k-1)df}=(2.3026)[(1.85979)(43)-79.81543]=0.358\]

This value of \(\chi^2\) is now compared with the value of \(\chi^2\) in table 4 for the desired probability level. A value greater than that given in the table would lead us to reject the homogeneity assumption.

The \(\chi^2\) value given by the above equation is biased upward. If \(\chi^2\) is nonsignificant, the bias is not important. However, if the computed \(\chi^2\) is just a little above the threshold value for significance, a correction for bias should be applied. The correction is:
\[C={3(k-1)+[\Sigma(\frac{1}{n_i-1})-\frac {1}{\Sigma (n_i-1)}] \over 3(k-1)}={3(4-1)+(0.475-\frac {1}{43}) \over 3(4-1)}=1.0502\]

The original form of this equation used natural logarithms in place of the common logarithms shown here. The natural log of any number is approximately 2.3026 times it common logarithm–hence the constant of 2.3026 in the equation. In computations, common logarithms are usually more convenient than natural logarithms.

The corrected value of \(\chi^2\) is then:

\[\text {Corrected } \chi^2=\frac {\text {Uncorrected }\chi^2}{C}=\frac {0.358}{1.0502}=0.341\]