It is common and often necessary to start with a set of underlying assumptions in statistics. Generally speaking, the response variable is set by y and the variable being tested by x, which can be vector value. This is commonly modelled deterministically:
In (F 1.1), is a vector of parameters and the functionis assumed to be known. To this model, random error or noise is commonly introduced (F 1.2). Due to the introduction of random error () into “y” , the response “y” is subsequently treated as a random variable Y:
In (F 1.2), one assumption that is commonly made is that follows a white-noise pattern with a distribution. It is also generally assumed that is measured with no (or negligible) error other than that expressed by .
(F 1.3) relies on four basic assumptions:
For the classical model to hold true, it is necessary that the assumptions listed in (i)-(iv) need to at least be seen as approximate. When this holds, the commonly deployed statistical tests may be used in stating inferences concerning the unknown parameters that define the function, Y.
In the real world, data often fails to satisfy the primary assumptions of the classical model. Many fields (including but not limited to quantitative economics, sociology, and physics) collect data from non-normally distributed populations. These may be skewed and asymmetric or otherwise non-normal. The classical tests fail in their assumption in these scenarios. Where this occurs, the conventional tests pertaining to the parameters of Y do not hold as valid. It is then up to the experimenter to either develop the hypothesis in a different manner, or to modify the tests being conducted (and hence the hypothesis) in a manner that fulfils the assumptions required by the classical methods. Graybill (1976) notes a number of options that are available to the experimenter:
1. disregard any violation of the assumptions and continue with the analysis as if the assumptions where indeed satisfied,
2. Choose an assumption other than that which is violated. In this event the experimenter should also use a valid procedure that addresses the assumptions being made,
3. Produce a different model of the data that incorporates the critical aspects of the original model whilst meeting all of the assumptions of the classical model (e.g. this could be achieved through the use of a suitable transformation to the data or through filtering suspect data and outliers),
4. Employ a distribution-free procedure that holds true even where the aforementioned assumptions are violated.
The use of data transforms is a well established practice. This will commonly create the situation where the data does not suffer from violations of the classical assumptions. Consequently, it is standard practice for experimenters to use the approach presented in point (3) and transform the data. This practice has been long documented through works such as those by Thoeni (1967) and Hoyle (1973).
Tukey (1958) proposed analysing the same data several times using an assorted range of transforms. These are stated to depend on the rationale of the particular stage of data analysis and the related principles. Box & Cox (1964) and Draper & Hunter (1969) have signified that other methods of producing transformations which both satisfy the classical assumptions and produce workable results can be obtained. Kendall & Stuart (1966) note difficulties associated with this procedure along with the robustness of the linear model.
The analysis of variance (ANOVA) remains one of the most important statistical techniques in use for the comparison dissimilar data classes with respect to their means. ANOVA relies heavily on the classical model assumptions. In particular, the approximation to normality (this is that the model conforms to aor standard normal distribution once transformed) relies on the homogeneity of variance. It is thus essential that the classical assumptions are met (or at least approximated) and this adds a requirement to ensure the assumption of homogeneity of variances is satisfied; hence the necessity to test these values.
ANOVA is used in testing the equality of means of K populations given a set of samples defined by:
representing the null hypothesis to be tested, versus
as the alternate hypothesis.
As noted above, there are several assumptions that must be satisfied before an F test for (F 1.4) may be conducted. These include:
- normally distributed variables,
- homogeneity of variances, and
- independence of observations.
This paper is primarily concerned with the testing of homogeneity of variances. Bishop (1976) notes that the assumption of homogeneity of variances is frequently unsatisfied in ANOVA testing. Where the homogeneity of variances assumption has not been met, an equal expected standard error for the sample means cannot be satisfied. As a result, the lack of knowledge about the true population variances for the variables introduces error into the evaluation of the F distribution leading to severe effects on the inferences made about the sample. This is especially pronounced in samples of varying sizes. Box (1954) noted this as a serious issue.
It has been suggested that even equally sized samples may be non-robust (Rogan & Keselman, 1977). It is proposed that the ANOVA F test suffers even at low levels of variance heterogeneity. As a consequence, ANOVA requires highly homogeneous samples to be of use. It is stated by Bishop (1976) that ANOVA F provides meagre controls over both Type I and Type II error rates. This problem is touted to occur in a wide range of variance heterogeneity situations and ranges of values. Cochran & Cox (1957) [see also Brown & Forsythe, 1974; Wilcox et al., 1986] demonstrate the issues with using ANOVA tests where the homogeneity of variances is not assured. Hence, it is essential to test the homogeneity of variance using robust test of homogeneity prior to using a test such as ANOVA.
Numerous tests of homogeneity of variances have been proposed in the literature over the last century (Bartlett 1937-a, 1937-b; Cochran 1941, 1951; Box 1953). Many of these tests have not withstood the test of time and have already disappeared from use. It is common to see a few alternatives appear from time to time with various degrees of robustness and/or departures from normality presented.
Many authors claim that a test of homogeneity of variances is a prerequisite to analysis of variance. Zar (1999) asserts that these tests are so non-robust that they fail to be useful. It is further claimed that ANOVA is one of the more robust tests when homoscedasticity cannot be assured. Although it was claimed that this holds even under conditions of non-normality, it can be demonstrated that this would not appear to in fact be the case. Underwood (1997) asserts that the problems with heterogeneity related to ANOVA for size matched samples only occur in the event that the variances are strikingly different and heterogeneous. This paper further asserts that ANOVA is not sensitive to the non-normality of the data (a condition that has major consequences for the use of most classical tests for homogeneity of variances). The Behrens-Fisher problem (Speed, 1987) with an analysis of variances comes in a comparison of sample mean taken from a normally distributed population where equal variances cannot be assumed.
As noted, many tests for the homogeneity of variances have been developed that cover a wide range of conditions (see Bartlett, 1937; Bartlett & Kendall, 1946; Box & Andersen, 1955; Cochran, 1941; Hartley, 1940, 1950; Games et al., 1972; Levy, 1975a, 1975b; Layard, 1973; Conover et al., 1981; Gupta & Rathie, 1983; Tang & Gupta, 1987; Lim & Loh, 1996; Nelson, 2000; Wilcox, 2002; Gupta et al., 2004). The Bartlett and Levene tests remain the most popular of these tests.
The introduction of fast computers and computational techniques has allowed for the introduction of stochastic permutation testing in recent times, a situation not available even 20 years back. This type of test is can be shown to assuage many of the problems associated with a lack of normality experienced with many parametric statistical tests. This allows us to look at the most common and popular tests for homogeneity of variance in the case of one-way ANOVA and to demonstrate that the Bartlett, F and Levene tests express questionable p-values in a number of situations. We go further to present an alternative that does not suffer from these issues.
We will show that the Bartlett, F and Levene tests are not only sensitive to the assumption of normality (Cochran & Cox, 1957; Levene, 1960; Conover et al., 1981; Weerahandi, 1995; Zar, 1999) and that when normality is suspect, the p-values and sizes (significance levels) are not reliable. This leads to the weak optimality of likelihood ratio tests no longer holding (the power is inadequate).
The Bartlett and Levene tests are two of the more frequently deployed tests of the homogeneity of variance that are used when assessing one-way ANOVA. Box & Anderson (1955) show that the probability of a Type I error (α) is dependent on the kurtosis of the distribution. They further noted that leptokurtic (γ2> 0) distributions exhibit a true value for α that is larger than a defined value of α. Conversely platykurtic (γ2< 0) distributions display a true value of α that is smaller than a defined value of α. A number of parametric tests exist as a substitute for ANOVA (Welch, 1951). In particular, tests of the equality of K (K ≥ 2) population means have been developed for data where the population variances are unequal (James, 1951; Brown & Forsythe, 1974; Wilcox, 1988; Alexander & Govern, 1994).
Many of these substitutes to the ANOVA test fail in conditions where normality has not been achieved (Oshima et al., 1994; Keselman et al., 1995; Hsuing & Olejnik, 1996). In fact, skewness and bias remain as problems with most of these alternatives (Wilcox, 1995). Others (Oshima & Algina, 1992-b) have stated that statistical procedures need to consider the homogeneity and normality constrains with more care. Wei-ming, (1999) notes that the development of adjustments such as trimming, transforming statistics, bootstrapping to the currently used tests (such as ANOVA F, O’Brien, Levene etc) should be undertaken. In particular, the issues of heterogeneity in variance and other aspects of non-normality need to be addressed (see also Keselmam et al., 2002). Conover et al. (1981) published a simulation study of 56 alternative procedures (including the classical tests). This study assessed both the robustness and power of these tests for the homogeneity of variance.
Winer et al. (1991) assert that all of the classical tests are susceptible to heterogeneity problems and other deviations from normality. It is suggested that new tests need to be developed with the ability to adapt to variations in the form of the distribution that is being analysed (Zar, 1999). In Chapter 3 of this paper, we will present the findings of a simulation study for both the classical tests for testing homogeneity of variances as well as introducing an alternative test procedure. The new test is anticipated to be an augmentation of the existing testing procedures. The new test is justified through the use of both simulation and actual data. This data exhibits a range of skew in the presented distributions. We demonstrate how the alternative tests exhibit a greater degree of robustness and power under conditions of heterogeneous variances.
In this paper, we analyse and compare several of the accepted tests for homogeneity of variances. In particular, we will concentrate on the following tests (which are defined below):
- F Test 2-sided. (ANOVA F Test)
- Bartlet χ2 test,
- Levene Tests (Levene 1, Levene 2, Levene 3 and Levene 4),
- Jacknife Test,
- Box Test,
- Cochran Test,
- O’Brien (0.5, alternate Jacknife),
- BF and Modified Brown-Forsythe Tests, and
- Wald Test.
The derivations of these tests will be presented in following posts. A size and power study will also be published.
There other tests in use such as the Scheffé-Box log-ANOVA test (Martin & Games 1977) that will also be introduced in future posts. The Scheffé-Box log-ANOVA test was developed using proposals by Box (1953) and Scheffé (1959). It first separates the variables from each group at random across several subgroups. The log-variance is then calculated for each subgroup. An F-type statistic is then used to contrast the among-group mean-square values with the within-group mean-squares of the log-variance values for each subgroup. Others, such as Efron (1982) have proposed using a transform to express the data in a form that more closely aligns to conditions of normality.
 This disregard for the assumptions assumes that the procedure is sufficiently robust. This is an postulation that a failure to adequately account for the assumption will result in only a small and insignificant deviation from the true result that would occur had the assumptions been met.
 See also Welch (1951), Boneau (1960), Glass et al. (1972), Bishop & Dudewicz (1978), Bryk & Raudenbush (1987), Wilcox (1989), Oshima & Algina (1992a), Alexander & Govern (1994), Sokal & Rholf (1995), Schneider & Penfield (1997), Wilcox (1997), Chen & Chen (1998), Luh & Guo (2000), Keselman et al. (2002), Mendes (2003), & Camdeviren & Mendes (2005). All of these authors have noted this same point, re-enforcing the work of Box (1951).