Thursday, 17 December 2009

Size and Power Study

The appropriate statistic to use under the conditions of a violation of assumptions can be selected using the statistic’s robustness and power as selection criteria. Robustness is defined as the capacity of the statistic to control Type I error. As such, a test of variances is robust if it does not detect non-homogeneous variances in the event that the original data is not normally distributed but the variances are homogeneous.

Peechawanich (1992) noted that if the probability of a Type I error occurring exceeds the Cochran limit, that the test will not be capable of controlling the error rate. As such, a test can be considered to be robust where the calculated probability of a Type I error lies within the Cochran limit. The Cochran limit of the discrepancy of the Type I error (clip_image002) from the nominal significance level (clip_image004) are set at the following values:

  •  clip_image006significance, clip_image008
  •  clip_image010significance, clip_image012
  • Where clip_image014is defined as the real probability of a Type I error occurring. This is also equal to the probability that clip_image016will be rejected when clip_image016[1]is actually true.
  •  clip_image002[1]is the empirically calculated value of a Type I error occurring.
  •  clip_image004[1]is the nominal level of significance. For this exercise, the values of clip_image004[2]=0.01 and clip_image004[3]=0.05 have been used.

A test’s power is the probability of rejecting the null hypothesis (clip_image019) when it is false and should correctly be rejected. The power of a test is calculated by taking the probability of a Type II (clip_image021) error[1] from the maximum power value (1.0). As such power is defined as:


As such, the power of a test ranges from a value of 0 (no power) to 1.0 (highly powerful).

Power studies rely on four context variables:

(1) the expected “size” of the effect (such as the approximate incidence rate for a population of survival times),

(2) the sample size of the data being evaluated,

(3) the statistical confidence level (α) used in the experiment, and

(4) the variety of analysis that is used on the data.

In this chapter, a size and power study of the aforementioned tests is presented. A test size study has been conducted in order to simulate sizes for sampling from normal distributions and a variety of non-normal distributions. These distributions increasingly vary in the levels of kurtosis displayed as they progressively move away from the conditions of normality. These distributions are created using a simulation study technique based on the simulation features of R. The majority of these tests have been evaluated at the 5% significance levels.

Power studies of tests aid in the determination of the relative effectiveness of the processes in a range of situations. A good deal of material has been published concerning power studies based on simulations and retroactive data analysis (e.g., Goodman & Berlin, 1994; Hayes & Steidl, 1997; Reed & Blaustein, 1995; Thomas, 1997; Zumbo & Hubley, 1998).

Statistical power can be seen as a fishing net, a low power tests (such as is due to low sample sizes) can be associated with large mesh nets. These will collect large values and generally miss most of the examples. This leads to accepting the null hypothesis when it is actually false. Tests can be constructed that are too sensitive. Using larger sample sizes may increase the probability that the postulated effect will be detected. In the extreme, extremely large samples greatly increase the probability of obtaining a dataset that contains randomly selected values that are correlated to the population and lead to high power. This increase in power comes at a cost. Many datasets do not allow for the economical selection of extremely large datasets. In destructive testing, any dataset that approaches the population also defeats the purpose of testing. Consequently, the selection of powerful tests that hold at low sample sizes is important. There is a trade-off between sample size and size of uncontrolled error. Choosing the test that provides the best statistical power can be essential.

There are four (4) possible results of any test,

(1) We conclude that H0 is true when H0 is true.

(2) We conclude that H0 is false when H0 is true.

(3) We conclude that H0 is true when H0 is false.

(4) We conclude that H0 is false when H0 is false.

Concluding either that H0 is true when H0 is true or that H0 is false when H0 is false can be seen as the desired outcome. Concluding that H0 is false when H0 is true is defined as a type I error (the erroneous rejection of H0). Concluding that H0 is true when H0 is false is defined as a type II error (the erroneous acceptance of H0). Type I and type II errors are undesirable.

The p-value is the risk of making a type I error[2]. The lower the alpha or beta values that is selected, the larger the sample size.


A Monte Carlo study was conducted to evaluate many of the common tests of the homogeneity of variances with regards to type I error stability and power. The type I error stability study assessed the relationship between observed rejection rates and the nominal rejection rates (a) when the homogeneity of variance hypothesis (2.1) was true. The power study evaluated the ability of each test to detect differences among sample variances when the homogeneity of variance hypothesis (2.1) was false.

Several R and Matlab functions were programmed to create a series of written to perform the Monte Carlo study. An example of the programs and functions used to create simulated datasets and the test is in the Appendix. The tests each used 100,000 replications. Random data with the proper characteristics were generated within the function and transformed as required, and the various forms of tests were computed. The random generation algorithms in R and Matlab provided the distributions used in this paper.

For each dataset under test and each test of the homogeneity of variance being examined, the function wrote one line of results to a MySQL database. This database was created with separate tables for each of the test datasets and SQL statements that contain the simulation parameters. A table with the various rejection rates of the null hypothesis for each test (for a given number of simulations) was contained in the same database as a separate table. The datasets where tested using the 95% and 99% (‘alpha’ =1 and ‘alpha’ =5) confidence interval of this rejection rate. The output was sent to a separate table in the same MySQL database. This was used to produce the summary tables presented in the appendix to this paper.

Data were generated with a large number of different distributions (these are displayed in the results, Tables 1 to 5). These distributions have been selected to align with those presented in the existing literature (Conover et. al., 1981) where multivariate tests have been conducted (ni where i =3, 4, 5, 6 or 8). These tests used ‘k-sample’ tests for homogeneity where is k > 2. These papers were also constrained in the sizes of the simulations with some as low as only 1,000 data points being generated.

The standard errors of all entries in Tables 1 to 5 are under 0.015.

The process used in this simulation study is as follows.

1) Select the sample size (ni), the number of simulation times (100,000), the level of α (0.01 and 0.05) as significance.

2) Generate independent random samples from the selected distributions.

3) Compute the test statistic for the simulated data for the various Homogeneity tests (e.g. Bartlett, Wald, Levene [Lev-1 to Lev-4], Cochrane, etc).

4) Repeat steps (ii) to (iii) 100,000 times and count the results where “the computed test statistic is greater than the corresponding critical value” and compute the proportion of rejection over 100,000 repetitions.

5) The proportion of rejections in step (4) is the estimated testing size when data are simulated under equality of variances, and power.

6) The entire process, steps (1) to (5) are repeated for varying (ni) values and divergent heterogeneity.

[1] A Type II error is defined as incorrectly accepting (a failure to reject) the null hypothesis (H0) when the null hypothesis is indeed wrong.

[2] The type I error is also designated as "alpha".

No comments: