Robust distances for outlier-free goodness-of-fit testing

Robust distances are mainly used for the purpose of detecting multivariate outliers. The precise definition of cut-off values for formal outlier testing assumes that the ''good'' part of the data comes from a multivariate normal population. Robust distances also provide valuable information on the units not declared to be outliers and, under mild regularity conditions, they can be used to test the postulated hypothesis of multivariate normality of the uncontaminated data. This approach is not influenced by nasty outliers and thus provides a robust alternative to classical tests for multivariate normality relying on Mahalanobis distances. One major advantage of the suggested procedure is that it takes into account the effect induced by trimming of outliers in several ways. First, it is shown that stochastic trimming is an important ingredient for the purpose of obtaining a reliable estimate of the number of ''good'' observations. Second, trimming must be allowed for in the empirical distribution of the robust distances when comparing them to their nominal distribution. Finally, alternative trimming rules can be exploited by controlling alternative error rates, such as the False Discovery Rate. Numerical evidence based on simulated and real data shows that the proposed method performs well in a variety of situations of practical interest. It is thus a valuable companion to the existing outlier detection tools for the robust analysis of complex multivariate data structures.

[1]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  J. A. Cuesta-Albertos,et al.  Trimming and likelihood: Robust location and dispersion estimation in the elliptical model , 2008, 0811.0503.

[4]  Ruben H. Zamar,et al.  Diagnosing Multivariate Outliers Detected by Robust Estimators , 2009 .

[5]  Peter Filzmoser,et al.  An Object-Oriented Framework for Robust Multivariate Analysis , 2009 .

[6]  K. Joossens Robust discriminant analysis , 2006 .

[7]  Alessio Farcomeni,et al.  A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion , 2008, Statistical methods in medical research.

[8]  Juan Antonio Cuesta-Albertos,et al.  Assessing when a sample is mostly normal , 2010, Comput. Stat. Data Anal..

[9]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[10]  Luis Angel García-Escudero,et al.  Generalized Radius Processes for Elliptically Contoured Distributions , 2005 .

[11]  Jan Beirlant,et al.  Goodness-of-fit analysis for multivariate normality based on generalized quantiles , 1999 .

[12]  D. Berry,et al.  Statistics: Theory and Methods , 1990 .

[13]  C. Croux,et al.  Principal Component Analysis Based on Robust Estimators of the Covariance or Correlation Matrix: Influence Functions and Efficiencies , 2000 .

[14]  C. Croux,et al.  Influence Function and Efficiency of the Minimum Covariance Determinant Scatter Matrix Estimator , 1999 .

[15]  Catherine Dehon,et al.  Influence functions of the Spearman and Kendall correlation measures , 2010, Stat. Methods Appl..

[16]  Mia Hubert,et al.  Fast and robust discriminant analysis , 2004, Comput. Stat. Data Anal..

[17]  Gert Willems,et al.  Robust and Efficient One-Way MANOVA Tests , 2011 .

[18]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[19]  H. Scheffé,et al.  The Analysis of Variance , 1960 .

[20]  Bernhard N Flury Multivariate Statistics: A Practical Approach , 1988 .

[21]  C. Mecklin,et al.  An Appraisal and Bibliography of Tests for Multivariate Normality , 2004 .

[22]  D. Hunter,et al.  Inference for mixtures of symmetric distributions , 2007, 0708.0499.

[23]  H. Riedwyl,et al.  Multivariate Statistics: A Practical Approach , 1988 .

[24]  David S. Moore,et al.  Chi-square tests for multivariate normality with application to common stock prices , 1981 .

[25]  Stefan Van Aelst,et al.  A Stahel-Donoho estimator based on huberized outlyingness , 2012, Comput. Stat. Data Anal..

[26]  Lee Shepstone Methods for Statistical Data Analysis of Multivariate Observations, Second Edition , 1998 .

[27]  James A. Koziol,et al.  A class of invariant procedures for assessing multivariate normality , 1982 .

[28]  M. Hubert,et al.  High-Breakdown Robust Multivariate Methods , 2008, 0808.0657.

[29]  Peter Filzmoser,et al.  Robust variable selection with application to quality of life research , 2011, Stat. Methods Appl..

[30]  Ramanathan Gnanadesikan,et al.  Methods for statistical data analysis of multivariate observations , 1977, A Wiley publication in applied statistics.

[31]  M. Gallegos,et al.  A robust method for cluster analysis , 2005, math/0504513.

[32]  Francesca Torti,et al.  Size and Power of Multivariate Outlier Detection Rules , 2013, Algorithms from and for Nature and Life.

[33]  Alfonso Gordaliza Ramos,et al.  A general trimming approach to robust cluster analysis , 2007 .

[34]  Rupasinghe Arachchige Don,et al.  Robust Multivariate Regression , 2013 .

[35]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[36]  Oleg A. Smirnov Computation of the Information Matrix for Models With Spatial Interaction on a Lattice , 2005 .

[37]  L. Bordes,et al.  SEMIPARAMETRIC ESTIMATION OF A TWO-COMPONENT MIXTURE MODEL , 2006, math/0607812.

[38]  H. P. Lopuhaä ASYMPTOTICS OF REWEIGHTED ESTIMATORS OF MULTIVARIATE LOCATION AND SCATTER , 1999 .

[39]  T. Banerjee Exploring Multivariate Data With the Forward Search , 2006 .

[40]  Alessio Farcomeni,et al.  Error rates for multivariate outlier detection , 2011, Comput. Stat. Data Anal..

[41]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .