The Wilcoxon–Mann–Whitney test under scrutiny

The Wilcoxon-Mann-Whitney (WMW) test is often used to compare the means or medians of two independent, possibly nonnormal distributions. For this problem, the true significance level of the large sample approximate version of the WMW test is known to be sensitive to differences in the shapes of the distributions. Based on a wide ranging simulation study, our paper shows that the problem of lack of robustness of this test is more serious than is thought to be the case. In particular, small differences in variances and moderate degrees of skewness can produce large deviations from the nominal type I error rate. This is further exacerbated when the two distributions have different degrees of skewness. Other rank-based methods like the Fligner-Policello (FP) test and the Brunner-Munzel (BM) test perform similarly, although the BM test is generally better. By considering the WMW test as a two-sample T test on ranks, we explain the results by noting some undesirable properties of the rank transformation. In practice, the ranked samples should be examined and found to sufficiently satisfy reasonable symmetry and variance homogeneity before the test results are interpreted.

[1]  D. J. Gans Use of a preliminary test in comparing two sample means , 1981 .

[2]  D. W. Zimmerman,et al.  Rank Transformations and the Power of the Student T Test and Welch T' Test for Non-Normal Populations with Unequal Variances , 1993 .

[3]  Anna Hart,et al.  Mann-Whitney test is not just a test of medians: differences in spread can be important , 2001, BMJ : British Medical Journal.

[4]  E. Brunner,et al.  The Nonparametric Behrens‐Fisher Problem: Asymptotic Theory and a Small‐Sample Approximation , 2000 .

[5]  George E. Policello,et al.  Robust Rank Procedures for the Behrens-Fisher Problem , 1981 .

[6]  M. Evans Statistical Distributions , 2000 .

[7]  D. W. Zimmerman,et al.  Invalidation of Parametric and Nonparametric Statistical Tests by Concurrent Violation of Two Assumptions , 1998 .

[8]  R. Iman,et al.  Rank Transformations as a Bridge between Parametric and Nonparametric Statistics , 1981 .

[9]  D. W. Zimmerman A Note on Homogeneity of Variance of Scores and Ranks , 1996 .

[10]  Failure of the Mann-Whitney Test: A Note on the Simulation Study of Gibbons and Chakraborti (1991) , 1992 .

[11]  Harry O. Posten,et al.  Robustness of the Two-Sample T-Test , 1984 .

[12]  H. Keselman,et al.  Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[13]  E. Ziegel Introduction to Robust Estimation and Hypothesis Testing (2nd ed.) , 2005 .

[14]  B. L. Welch THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN TWO MEANS WHEN THE POPULATION VARIANCES ARE UNEQUAL , 1938 .

[15]  G. Forrester,et al.  Robustness of the t and U tests under combined assumption violations , 1998 .

[16]  D. A. Penfield Choosing a Two-Sample Location Test. , 1994 .

[17]  R. Blair,et al.  A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .

[18]  P. Bridge,et al.  Increasing physicians' awareness of the impact of statistics on research outcomes: comparative power of the t-test and and Wilcoxon Rank-Sum test in small samples applied research. , 1999, Journal of clinical epidemiology.

[19]  B. Moser,et al.  The two-sample t test versus satterthwaite's approximate f test , 1989 .

[20]  E. Skovlund,et al.  Should we always choose a nonparametric test when comparing two apparently nonnormal distributions? , 2001, Journal of clinical epidemiology.