Chapter 5 Chi-Square Tests

5.1 Test of Independence

Individuals are often classified according to two (or more) distinct systems. A tree can be classified as to species and at the same time according to whether or not it is infected with some disease. A milacre plot can be classified as to whether or not it is stocked with adequate reproduction and whether it is shaded or not shaded. Given such a cross-classification, it may be desirable to know whether the classification of an individual according to one system is independent of its classification by the other system. In the species-infection classification, for example, independence of species and infection would be interpreted to mean that there is no difference in infection rate between species (i.e., infection rate does not depend on species).

The hypothesis that two or more systems of classification are independent can be tested by chi-square. The procedure can be illustrated by a test of three termite repellents. A batch of 1,500 wooden stakes was divided at random into three groups of 500 each, and each group received a different termite-repellent treatment. The treated stakes were driven into the ground, with the treatment at any particular stake location being selected at random. Two years later the stakes were examined for termites. The number of stakes in each classification is shown in the following 2 by 3 (two rows and three columns) contingency table:

Classification	Group I	Group II	Group III	Subtotals
Attacked by termites	193	148	210	551
Not attacked	307	352	290	949
Subtotals	500	500	500	1500

If the data in the table is symbolized as shown below:

Classification	I	II	III
Attacked	\(a_1\)	\(a_2\)	\(a_3\)	A
Not attacked	\(b_1\)	\(b_2\)	\(b_3\)	B
Subtotals	\(T_1\)	\(T_2\)	\(T_3\)	G

The test of independence is made by computing:

\[\chi^2=\frac {1}{(A)(B)} \Sigma_{i=1}^3(\frac {[a_1B-b_iA]^2}{T_i})\]

\[=\frac {1}{(551)(949)}[\frac {((193)(949)-(307)(551))^2}{500}+...+\frac {((210)(949)-(290)(551))^2}{500}]=17.66\]

This result is compared to the tabular value of \(\chi^2\) (table 4) with \((c-1)\) degrees of freedom, where \(c\) is the number of columns in the table of data. If the computed value exceeds the tabular value given in the 0.05 column, the difference among treatments is said to be significant at the 0.05 level (i.e., we reject the hypothesis that attack classification is independent of treatment classification).

In this example, the computed value of 17.66 (2 degrees of freedom) exceeds the tabular value in the 0.01 column, and so the difference in rate of attack among treatments is significant at the 1-percent level. Examination of the data suggests that this is primarily due to the lower rate of attack on the Group II stakes.

The r by c contingency table. The above example is a simple case of the chi-square test of independence in an \(r\) by \(c\) table (i.e., \(r\) rows and \(c\) columns). Thus, if a number of randomly selected forest stands were classified as to soil group and forest type the results might be as follows:

Soil group	Forest Type I	Forest Type II	Forest Type III	Subtotal
1	27	48	62	137
2	32	46	67	145
3	26	51	61	138
Subtotal	85	145	190	420

If the \(r\) by \(c\) table is represented in symbols:

Soil group	Forest Type I	Forest Type II	Forest Type III	Subtotal
1	\(a_{11}\)	\(a_{12}\)	\(a_{13}\)	\(S_1\)
2	\(a_{21}\)	\(a_{22}\)	\(a_{23}\)	\(S_2\)
3	\(a_{31}\)	\(a_{32}\)	\(a_{33}\)	\(S_3\)
Subtotal	\(T_1\)	\(T_2\)	\(T_3\)	G

then the test of independence is: \[\chi^2=\frac{1}{G} \Sigma_{i,j}(\frac {Ga_{ij}-S_iT_j)^2}{S_iT_j}),\text { with } (r-1)(c-1) \text { degrees of freedom}\]

In this example: \[\chi^2_{4 df}=\frac {1}{420}[\frac {((420)(27)-(137)(85))^2}{(85)(137)}+...+\frac {((420)(61)-(138)(190))^2}{(138)(190)}]=1.031\]
which is not significant at the 0.05 level. Thus, the test has failed to demonstrate any real association between forest types and soil groups.

The test of independence can be extended to more than two classification systems, by formulating meaningful hypotheses may be difficult.

5.2 Test of a Hypothesized Count

A geneticist hypothesized that, if a certain cross were made, the progeny would be of four types, in the proportions:

\[A = 0.48\] \[B=0.32\] \[C=0.12\] \[D=0.08\]

The actual segregation of 1,225 progeny is shown below, along with the numbers expected according to the hypothesis.

Type	A	B	C	D	Total
Number \((X_i)\)	542	401	164	118	1225
Expected \((m_i)\)	588	392	147	98	1225

As the observed counts differ from those expected, we might wonder if the hypothesis is false. Or, can departures as large as this occur strictly by chance?

The chi-square test is:

\[\chi^2=\Sigma^k_{i-1}(\frac {(X_i-m_i)^2}{m_i}),\text { with (k-1) degrees of freedom}\]

where:

\(k\)=The number of groups recognized
\(X_i\)=The observed count for the \(i^{th}\) group
\(m_i\)=The count expected in the \(i^{th}\) group if the hypothesis is true.

For the above data, \[\chi^2_{3df}=\frac {(542-588)^2}{588}+\frac {(401-392)^2}{392}+\frac {(164-147)^2}{147}+\frac {(118-98)^2}{98}=9.85\]

This value exceeds the tabular \(\chi^2\) with 3 degrees of freedom at the 0.05 level. Hence the hypothesis would be rejected (if the geneticist believed in testing at the 0.05 level).

5.3 Bartlett’s Test of Homogeneity of Variance

Many of the statistical methods described later are valid only if the variance is homogeneous. The t test of the following section assumes that the variance is the same for each group, and so does the analysis of variance. The fitting of an unweighted regression as described in the last section also assumes that the dependent variable has the same degree if variability (variance) for all levels of the independent variables.

Bartlett’s test offers a means of evaluating this assumption. Suppose that we have taken random samples in each of four groups and obtained variances \((s^2)\) of 84.2, 63.8, 88.6, and 72.1 based on samples of 9, 21, 6, and 11 units, respectively. We would like to know if these variances could have come from populations all having the same variance. The quantities needed for Bartlett’s test are tabulated here:

Group	Variance \((s^2)\)	(n-1)	Corrected sum of squares SS	\(\frac {1}{n-1}\)	\(\text {log } s^2\)	\((n-1)(\text {log } s^2)\)
1	84.2	8	673.9	0.125	1.92531	15.40248
2	63.8	20	1276.0	0.050	1.80482	36.09640
3	88.6	5	443.0	0.200	1.94743	9.73715
4	72.1	10	721.0	0.100	1.85794	18.57940
\(k=4\) groups	Sums	43	3113.6	0.475		79.81543

where:

\(k = \text {The number of groups }\)\((=4)\).
\(SS = \text {The corrected sum of squares}\) = \((\Sigma X^2-\frac {(\Sigma X)^2}{n})=(n-1)s^2\)

From this we compute the pooled within-group variance:

\[\bar s^2=\frac {\Sigma SS_i}{\Sigma(n_i-1)}=\frac {3113.6}{43}=72.4093\]

and \(\log \bar s^2=1.85979\)

Then the test of homogeneity is: \[\chi^2_{(k-1)df}=(2.3026)[(1.85979)(43)-79.81543]=0.358\]

This value of \(\chi^2\) is now compared with the value of \(\chi^2\) in table 4 for the desired probability level. A value greater than that given in the table would lead us to reject the homogeneity assumption.

The \(\chi^2\) value given by the above equation is biased upward. If \(\chi^2\) is nonsignificant, the bias is not important. However, if the computed \(\chi^2\) is just a little above the threshold value for significance, a correction for bias should be applied. The correction is:
\[C={3(k-1)+[\Sigma(\frac{1}{n_i-1})-\frac {1}{\Sigma (n_i-1)}] \over 3(k-1)}={3(4-1)+(0.475-\frac {1}{43}) \over 3(4-1)}=1.0502\]

The original form of this equation used natural logarithms in place of the common logarithms shown here. The natural log of any number is approximately 2.3026 times it common logarithm–hence the constant of 2.3026 in the equation. In computations, common logarithms are usually more convenient than natural logarithms.

The corrected value of \(\chi^2\) is then:

\[\text {Corrected } \chi^2=\frac {\text {Uncorrected }\chi^2}{C}=\frac {0.358}{1.0502}=0.341\]