Effect of Nonnormality on Test Statistics for One-Way Independent Gr

The data obtained from one-way independent groups designs is typically nonnormal in form and rarely is equally variable across treatment populations (i.e., population variances are heterogeneous). Consequently, the classical test statistic that is used to assess statistical significance [i.e., the analysis of variance (ANOVA) F-test] typically provides invalid results (e.g., too many Type I errors, reduced power). For this reason, there has been considerable interest in finding a test statistic that is appropriate under conditions of nonnormality and variance heterogeneity. Previously recommended procedures for analyzing such data include the James (1951) test, the Welch (1951) test applied either to the usual least squares estimators of central tendency and variability, or the Welch test with robust estimators, i.e., trimmed means and Winsorized variances. A new statistic proposed by Krishnamoorthy, Lu and Mathew (2007), intended to deal with heterogeneous variances, though not nonnormality, uses a parametric bootstrap procedure. In their investigation of the parametric bootstrap test, the authors examined its operating characteristics under limited conditions and did not compare it to the Welch test based on robust estimators. Thus, we investigated how the parametric bootstrap procedure, and a modified parametric bootstrap procedure based on trimmed means, perform relative to previously recommended procedures when data are nonnormal and heterogeneous. The results indicated that the tests based on trimmed means offer the best Type I error control and power when variances are unequal and at least some of the distribution shapes are nonnormal. Effects of Nonnormality on Test Statistics 3 Effects of Nonnormality on Test Statistics for One-Way Independent Groups Designs A common question in the behavioural sciences is whether treatment groups differ on an outcome variable. For example, a researcher may be interested in determining if eating disorder symptomatology (e.g., obsession with weight) vary across different cultural backgrounds. The procedure that is most popular for analyzing data from one-way independent groups designs is the analysis of variance (ANOVA) F-test. The ANOVA can be a valid and powerful test for identifying treatment effects; but, when the validity assumptions underlying the test are violated, the results from the test are typically unreliable and invalid. One mathematical validity assumption of the test (i.e., a condition that was stipulated in order to derive the test statistic) is that the distribution of each population is normal in form. Although this is assumed by most researchers, it is very often not the case (Micceri, 1989). Nonnormality can have deleterious effects on the F-test, where predominantly there is a lack of sensitivity to detect treatment effects (Wilcox, 1997). As well, there is an increased risk that null effects will be falsely declared statistically significant (i.e., an elevated probability of committing a Type I error), especially when sample sizes are small. A second mathematical restriction that was adopted when deriving the test statistic was that the population variances be equal. It is well known that unequal variances are the norm, rather than the exception, with behavioral science data (Erceg-Hurn & Mirosevich, 2008; Golinski & Cribbie, 2009; Grissom, 2000; Keselman et al., 1998), with largest to smallest group ratios greater than ten not uncommon (Grissom, 2000; Wilcox, 1987). Moreover, unequal variances can have drastic effects on the reliability and validity of the F-test, especially when Effects of Nonnormality on Test Statistics 4 group sample sizes are also unequal (Glass, Peckham & Sanders, 1972; Harwell, Rubenstein, Hayes & Olds, 1992; Kohr & Games, 1974; Scheffé, 1959). When distributions are nonnormal and variances are unequal, the empirical probability of a Type I or Type II error for the F-test can deviate even more substantially from the nominal levels than when either assumption is independently violated (Glass, Peckham & Sanders, 1972; Luh & Guo, 2001). Several procedures have been recommended for analyzing the data from one-way independent groups designs when distributions are nonnormal and variances are unequal (e.g., Brunner, Dette, & Munk, 1997; Cribbie, Wilcox, Bewell & Keselman, 2007; Wilcox & Keselman, 2003). Currently, the most recommended approaches involve utilizing the James (1951) or Welch (1951) heteroscedastic F-tests (based on the usual least squares estimators), or the Welch heteroscedastic F-test with trimmed means and Winsorized variances. Several studies have demonstrated that the original James and Welch procedures are generally robust (with respect to Type I errors and power) when group variances and sample sizes are extremely unequal (e.g., Kohr & Games, 1974; Krisnamoorty, Lu & Mathew, 2007), and further that the test is robust to unequal variances and nonnormal data, as long as the nonnormality is mild to moderate (Algina, Oshima, & Lin, 1994). The Welch test with trimmed means and Winsorized variances has also been shown to provide excellent Type I error control and power even under extreme violations of the normality and variance equality assumptions (Keselman, Wilcox, Othman & Fradette, 2002). An important condition of nonnormality that has received very little attention in the methodological literature is the case of dissimilar distribution shapes across treatment groups. For example, it is not uncommon for behavioral science researchers to encounter one group with Effects of Nonnormality on Test Statistics 5 an approximately normal distribution and another group with a skewed distribution. For example, Leentjens, Wielaert, van Harskamp and Wilmink (1998) found that scores on many measures of nonverbal aspects of language (i.e., prosody) were normally distributed in control groups, but were extremely skewed in schizophrenic patients. Wilcox (2005) notes that skewed distributions in general are not as problematic as when groups have different amounts of skewness. Indeed, Tiku (1964) explored situations where skew differed between groups and found that Type I and Type II errors were adversely affected when groups are skewed in opposite directions, especially with smaller sample sizes. It is important to point out that when distribution shapes are dissimilar, isolating the specific nature of the differences in the distributions is an important part of the data analysis (and comparisons of central tendencies may be less informative). For example, when distribution shapes are dissimilar, alternative descriptive statistics, such as the specific quantiles (e.g., 10th, 25th, 75th, 90th) for each distribution, can be useful in understanding differences between the distributions. Further, if one suspects that distribution shapes might be dissimilar, it might be fruitful to explicitly test for differences in the distributions using a runs test, such as the Wald-Wolfowitz, or a test of a common distribution, such as the Kolmogorov-Smirnov or Cramer-von Mises tests (see Sprent & Smeeton, 2001, pages 185-188). For example, in the Leentjens et al. (1998) study described above, the goal of the researchers was to compare the central tendencies of the groups, although specific tests used to isolate differences in the shapes of the distributions may have also been informative. Thus, when distribution shapes differ, researchers may be interested in exploring differences in the central tendencies, exploring the nature of the distributional differences, or both. Since the underlying goal of most studies in psychology that involve comparing groups is to compare the Effects of Nonnormality on Test Statistics 6 central tendencies, this study addresses the important question of how available test statistics perform under these conditions. The parametric bootstrap procedure proposed by Krishnamoorthy et al. (2007) is a relatively new statistic for comparing the means of independent groups when the variances of the groups are unequal. This test involves generating sample statistics from parametric models, where the parameters in the model are replaced by their estimates (see below for details regarding the parametric bootstrap procedure). This procedure was found by the authors to provide a better balance of Type I error control and power than the original Welch (1951) procedure, especially when sample sizes were small and the number of groups was large. There are, however, important questions that were not explored by Krishnamoorthy et al. (2007). For example, how well will the Krishnamoorthy et al. procedure perform (with respect to controlling Type I and II error rates) when distribution shapes are nonnormal? This question is important because, as discussed earlier, distributions in the behavioural sciences are rarely normal. An important point related to this issue is how to distinguish between a normally distributed variable and nonnormally distributed variable. Although numerous test statistics have been proposed for detecting deviations from normality (e.g., Chen & Shapiro, 1995; D’Agostino, 1971; Shapiro & Wilk, 1965), it is also important to consider that: 1) the performance of tests of normality are greatly affected by sample size, the form of nonnormality, etc. (Seier, 2002); 2) graphical methods (e.g, histograms, boxplots, normal quantile plots) can sometimes be as informative as tests of normality for detecting deviations from normality (Holgersson, 2006); and most importantly, 3) the power of many traditional parametric tests can be severely affected by even slight deviations from normality (Wilcox, 2005). Therefore, even though there is Effects of Nonnormality on Test Statistics 7 subjectivity in deciding whether or not a distribution is normal, it is important that we are aware of how various test statistics perform under different degrees of nonnormality in order to be able to make informed recommendations regarding the appr

[1]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  David C. Hoaglin,et al.  Summarizing Shape Numerically: The g‐and‐h Distributions , 2011 .

[4]  Eugenia Stoimenova,et al.  Applied Nonparametric Statistical Methods , 2010 .

[5]  R. Cribbie,et al.  The expanding role of quantitative methodologists in advancing psychology. , 2009 .

[6]  David M Erceg-Hurn,et al.  Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. , 2008, The American psychologist.

[7]  Robert A. Cribbie,et al.  Tests for Treatment Group Equality When Data are Nonnormal and Heteroscedastic , 2007 .

[8]  K. Krishnamoorthy,et al.  A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models , 2007, Comput. Stat. Data Anal..

[9]  S. Shapiro,et al.  An analysis of variance test for normality ( complete samp 1 es ) t , 2007 .

[10]  H. E. Holgersson,et al.  A graphical method for assessing multivariate normality , 2006, Comput. Stat..

[11]  Julia Kastner,et al.  Introduction to Robust Estimation and Hypothesis Testing , 2005 .

[12]  James Algina,et al.  An alternative to Cohen's standardized mean difference effect size: a robust parameter and confidence interval in the two independent groups case. , 2005, Psychological methods.

[13]  E. Ziegel Introduction to Robust Estimation and Hypothesis Testing (2nd ed.) , 2005 .

[14]  H. Keselman,et al.  Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[15]  Rand R. Wilcox,et al.  Trimming, Transforming Statistics, And Bootstrapping: Circumventing the Biasing Effects Of Heterescedasticity And Nonnormality , 2002 .

[16]  W. Luh,et al.  Using Johnson's transformation and robust estimators with heteroscedastic test statistics: an examination of the effects of non-normality and heterogeneity in the non-orthogonal two-way ANOVA design. , 2001, The British journal of mathematical and statistical psychology.

[17]  Rand R. Wilcox,et al.  Testing Repeated Measures Hypotheses When Covariance Matrices are Heterogeneous: Revisiting the Robustness of the Welch-James Test Again , 2000 .

[18]  R. Serlin,et al.  Testing for robustness in Monte Carlo studies. , 2000, Psychological methods.

[19]  R. Grissom,et al.  Heterogeneity of variance in clinical data. , 2000, Journal of consulting and clinical psychology.

[20]  Alfio Marazzi,et al.  The truncated mean of an asymmetric distribution , 1999 .

[21]  Carl J. Huberty,et al.  Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses , 1998 .

[22]  Rand R. Wilcox,et al.  The goals and strategies of robust methods , 1998 .

[23]  Rand R. Wilcox,et al.  How many discoveries have been lost by ignoring modern statistical methods , 1998 .

[24]  A. Leentjens,et al.  Disturbances of affective prosody in patients with schizophrenia; a cross sectional study , 1998, Journal of neurology, neurosurgery, and psychiatry.

[25]  Holger Dette,et al.  Box-Type Approximations in Nonparametric Factorial Designs , 1997 .

[26]  R. Wilcox A Bootstrap Modification of the Alexander-Govern ANOVA Method, Plus Comments on Comparing Trimmed Means , 1997 .

[27]  S. Shapiro,et al.  An alernative test for normality based on normalized spacings , 1995 .

[28]  Rand R. Wilcox,et al.  ANOVA: The practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies , 1995 .

[29]  T. C. Oshima,et al.  Type I Error Rates for Welch’s Test and James’s Second-Order Test Under Nonnormality and Inequality of Variance When There Are Two Groups , 1994 .

[30]  Rand R. Wilcox,et al.  Some Results on the Tukey-Mclaughlin and Yuen Methods for Trimmed Means when Distributions are Skewed , 1994 .

[31]  Michael R. Harwell,et al.  Summarizing Monte Carlo Results in Methodological Research: The One- and Two-Factor Fixed Effects ANOVA Cases , 1992 .

[32]  S. Sheather,et al.  Robust Estimation and Testing , 1990 .

[33]  T. Micceri The unicorn, the normal curve, and other improbable creatures. , 1989 .

[34]  R. Wilcox A Heteroscedastic ANOVA Procedure With Specified Power , 1987 .

[35]  Jurg. Hiisler On the two-sample adaptive distribution-free test , 1987 .

[36]  John Law,et al.  Robust Statistics—The Approach Based on Influence Functions , 1986 .

[37]  Paul A. Games,et al.  Robustness of the Analysis of Variance, the Welch Procedure and a Box Procedure to Heterogeneous Variances , 1974 .

[38]  G. Glass,et al.  Consequences of Failure to Meet Assumptions Underlying the Fixed Effects Analyses of Variance and Covariance , 1972 .

[39]  R. D'Agostino An omnibus test of normality for moderate and large size samples , 1971 .

[40]  M. Tiku Approximating the general non-normal variance-ratio sampling distributions , 1964 .

[41]  B. L. Welch ON THE COMPARISON OF SEVERAL MEAN VALUES: AN ALTERNATIVE APPROACH , 1951 .