Saturday, 19 December 2009

Results and Discussions – Homogeneity Tests

The Monte Carlo simulation described in the previous post showed that

  • Bartlett test is not as robust as Levene tests against the violation of the normality assumption.
  • All four Levene tests are less powerful than the Bartlett test.
  • ANOVA F provides generally poor control over both Type I and Type II error rates under a wide range of heterogeneity variance situations.
  • The Wald test (as proposed by Rayner, 1997) is comparable with the versions of Levene tests.

Variance ratio 1:1

The results where the samples have equal variances are listed below for the Log Normal (0,1), Exponential (1), Gamma (2,1) , Chi 2(2) and Beta (6, 1.5) distributions. These distributions are displayed for the Bartlett (B), Wald (W) and Levene tests (Lev1, Lev 2, Lev 3 and Lev4). The testing power for the previously listed distributions are graphed and displayed below.

Log Normal (0,1)


The characteristics of the Log Normal (0,1) distribution is a variance of 4.67 and a skewness of 6.19.

Exponential (1)


The characteristics of Exponential (1) distribution is a variance of 1.000 and a skewness of 2.0.

Gamma (2,1)


The characteristics of Gamma (2,1) distribution is a variance of 2.00 and a skewness of 1.41.

Chi 2(5)


The characteristics of Chi 2(5) distribution is a variance of 10.0 and a skewness of 1.27.

Beta (6,1.5)


The characteristics of Beta(6,1.5) distribution is a variance of 0.019 and a negative skewness of -0.921.

In power, the Bartlett test performed best. Levene -2 was more robust in controlling Type I error rate, but also displayed the least power among the six (6) tests considered above. The Wald test is comparable with the other versions of Levene tests (Lev-1, Lev-3 and Lev-4).

Variance ratio 1:4

The use of heterogeneous variances has a greater relationship to the sample sizes than when the values are homogeneous. In this instance, the power increases rapidly as the sample size increases for both equal and samples.

Log Normal (0,1)


Exponential (1)


Gamma (2,1)


Chi 2(5)


Beta (6,1.5)


We see from the results above (and displayed in the tables included in the appendix), that the underlying distribution of the sample data plays a role in the selection of the ideal homogeneity test


This paper compared the empirical type I error and power of several commonly used tests of variance homogeneity. These tests assess the level of homogeneity of within-group variances. The tests of homogeneity of variance that have been evaluated in this paper include:

· ANOVA-F test,

· Bartlett's test,

· the ScheffĂ©-Box log-ANOVA test,

· Box’s M test,

· Cochran’s C test

· Levene’s tests (Lev-1, Lev-2, Lev-3 and Lev-4)

· Wald’s Test

These tests have been evaluated in both their parametric and permutational forms. This paper has explored the conditions where the ANOVA-F and Levene test p-values are questionable at best and has evaluated the conditions were heterogeneity of variances really a problem in these tests. These conditions are further analysed such that the usefulness of the various tests for the homogeneity of variance for the detection of heterogeneity can be compared in a variety of data distributions.

A preliminary simulation study confirmed that the ANOVA-F is extremely sensitive to heterogeneity of the variances. This was confirmed in situations where the assumption of normality was otherwise achieved. The ANOVA-F was sensitive to even low levels of heteroscedasticity. This caused inflated type I errors. This was particularly pronounced in the case where the variance of a single group was larger than the other groups that approximated each other.

The use of non-normal data with heavy tails is problematic for many of the standard tests. It was demonstrated that the parametric tests are extremely sensitive to heteroscedasticity. The existence of a heavy tail (kurtosis) generally results in a loss of power in the various significance tests for heterogeneity of variance. The simulation was expanded to incorporate more extreme conditions (small sample sizes, non-normal distributions).

Both Cochran's test as well as the log-ANOVA test can be shown to display undue levels of sensitivity when even a solitary high variance value exists amongst the groups. It is also shown that these tests have low power when small to moderate sized data samples are tested.

Both the Bartlett's and Box's tests performed well where the sample sizes were relatively large. The Bartlett test was not as robust as any of the four Levene tests when the normality assumption was violated (even where large datasets have been used). At the same time, Bartlett's test displays a higher level of power then the Levene’s tests. The

From these results, we can construct an algorithm that can aid in the determination of which homogeneity test should be used (see Table 1).


In table 2 this process is extended for the selection of testing procedures.



Anova test results can be demonstrated to be largely independent of sample size (within 5 to 100 observations per group). When the variances are homogeneous, Anova yields correct type I error irrespective of the distribution. For normal data, type I error is overstated when one of the variances is higher than the other. The problem is worse for non-normal distributions.

Effect of sample size

When the variances are not equal, the tests are influenced more by the sample size. As such, care should be applied to unequal samples. The Bartlett test provides the most power. Levene's second test (Lev-2) was more robust in controlling Type I error rate. This was countered by its tendency to display the lowest power. The Wald test is comparable with other versions of Levene tests.


Heterogeneity of variances is always a problem in Anova. The effect of variance heterogeneity with Anova is somewhat to extremely exaggerated type I error.

The most effective methods to test the homogeneity of variances are Bartlett's or Box's tests. Cochran's test should be avoided. The log-Anova test exhibits low power with small to moderate sample sizes. Bartlett's and Box's tests can be used if the samples are fairly large (ni>20).

Bartlett's and Cochran's tests have and uncontrolled risk of Type I errors when the populations are asymmetric and heavy-tailed. Levene-1 fairs well in both robustness and power in a variety of situations (especially when the population mean is unknown). The type I error rate did become overstated to an unsatisfactory level in cases where the mean was unknown.

The assumptions of the parametric test are a linear relationship in the mean function, normal error and correct specification of the form of the variance function. When the assumptions of parametric test are violated, the nonparametric tests can be more powerful.

The Bartlett test is not as robust as Levene tests against the violation of the normality assumption. In general, the Levene tests are less powerful than the Bartlett test. The Wald test is more robust than the Bartlett test against the violation of the normality assumption. This test is poised between the Bartlett test and the four versions of the Levene tests in terms of Type I error rate and power. While Lev-2 (Levene’s 2nd test) was robust in controlling Type I error rate, it displayed a lower power than many other tests.

Bartlett’s test displayed the highest power in the majority of distributions. This test has the feature of rejecting the null hypothesis of equality of variances the greatest number of times. Bartlett’s test is also associated with has insufficient control of the type I error rate. Levene’s second test (Lev-2), displays low power but is highly robust in the control of the type I error rate. The Wald test was demonstrated as a balance between the Bartlett test and all versions of Levene tests.

When the datasets are asymmetric and heavy tailed, most tests for the homogeneity of variances perform poorly.

No comments: