Outliers detection with the minimum covariance determinant estimator in practice

Abstract Robust statistics have slowly become familiar to all practitioners. Books entirely devoted to the subject (e.g. [R.A. Maronna, R.D. Martin, V.J. Yohai, Robust Statistics: Theory and Methods. John Wiley & Sons, New York, NY, USA, 2006; P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, New York, NY, USA, 1987], …) are without any doubt responsible for the increased practice of robust statistics in all fields of applications. Even classical books often have at least one chapter (or parts of chapters) which develops robust methodology. The improvement of computing power has also contributed to the development of a wider and wider range of available robust procedures. However, this success story is now menacing to get backwards: non-specialists interested in the application of robust methodology are faced with a large set of (assumed equivalent) methods and with over-sophistication of some of them. Which method should one use? How should the (numerous) parameters be optimally tuned? These questions are not so easy to answer for non-specialists! One could then argue that default procedures are available in most statistical software (Splus, R, SAS, Matlab, …). However, using as illustration the detection of outliers in multivariate data, it is shown that, on one hand, it is not obvious that one would feel confident with the output of default procedures, and that, on the other hand, trying to understand thoroughly the tuning parameters involved in the procedures might require some extensive research. This is not conceivable when trying to compete with the classical methodology which (while clearly unreliable) is so straightforward. The aim of the paper is to help the practitioners willing to detect in a reliable way outliers in a multivariate data set. The chosen methodology is the Minimum Covariance Determinant estimator being widely available and intuitively appealing.

[1]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[2]  P. Rousseeuw,et al.  Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices , 1991 .

[3]  Michael Schyns,et al.  The case sensitivity function approach to diagnostic and robust computation: A relaxation strategy , 2004 .

[4]  Jaromir Antoch,et al.  COMPSTAT 2004 — Proceedings in Computational Statistics , 2004 .

[5]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[6]  M. Jhun,et al.  Asymptotics for the minimum covariance determinant estimator , 1993 .

[7]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[8]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[9]  G. Willems,et al.  Small sample corrections for LTS and MCD , 2002 .

[10]  Peter J. Rousseeuw,et al.  Robust Distances: Simulations and Cutoff Values , 1991 .

[11]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[12]  Georg Ch. Pflug,et al.  Mathematical statistics and applications , 1985 .

[13]  Francisco J. Prieto,et al.  Multivariate Outlier Detection and Robust Covariance Matrix Estimation , 2001, Technometrics.

[14]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[15]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[16]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[17]  David M. Rocke,et al.  Computable Robust Estimation of Multivariate Location and Shape in High Dimension Using Compound Estimators , 1994 .

[18]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[19]  C. Croux,et al.  Influence Function and Efficiency of the Minimum Covariance Determinant Scatter Matrix Estimator , 1999 .