An Extension of a Method of Hardin and Rocke , with an Application to Multivariate Outlier Detection via the IRMCD Method of Cerioli

Hardin and Rocke investigated the distribution of the robust Mahalanobis squared distance (RSD) computed using the minimum covariance determinant (MCD) estimator. They showed that the distribution of RSDs for outlying observations not part of the MCD subset is well-approximated by an F distribution. They developed a methodology to adjust an asymptotic formula for the degrees of freedom parameters of this F distribution to provide correct parameter values in small-to-moderate samples. This methodology was developed for the maximum breakdown point version of the MCD, which is based on approximately half of the observations. Whether the approximation remains accurate for the MCD using larger subsets of the data is an open question. We show that their approximation works quite well for the more general MCD, but can be noticeably inaccurate for sample sizes less than 250 and when the MCD estimate uses nearly all of the observations. Motivated by the desire to apply RSD-based outlier detection tests to financial asset return and factor exposure data sets whose typical sample sizes are smaller than 250, we develop a more general correction procedure that is accurate across a wider range of sample sizes and MCD subset sizes than the Hardin and Rocke approach. We use our approach to extend Cerioli’s IRMCD procedure for accurate RSD-based outlier tests to arbitrary MCD subset sizes.

[1]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[2]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[3]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[4]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[5]  A. Guillou,et al.  Robust and asymptotically unbiased estimation of extreme quantiles for heavy tailed distributions , 2014 .

[6]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[7]  J. A. Cuesta-Albertos,et al.  Trimming and likelihood: Robust location and dispersion estimation in the elliptical model , 2008, 0811.0503.

[8]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[9]  J. L. Warner,et al.  Methods for Assessing Multivariate Normality , 1973 .

[10]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate Outlier Identification Rules , 1999 .

[11]  G. Seber Multivariate observations / G.A.F. Seber , 1983 .

[12]  Peter J. Rousseeuw,et al.  Robust Distances: Simulations and Cutoff Values , 1991 .

[13]  Ursula Gather,et al.  The largest nonindentifiable outlier: a comparison of multivariate simultaneous outlier identification rules , 2001 .

[14]  C. Croux,et al.  Influence Function and Efficiency of the Minimum Covariance Determinant Scatter Matrix Estimator , 1999 .

[15]  S. J. Devlin,et al.  Robust Estimation of Dispersion Matrices and Principal Components , 1981 .

[16]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[17]  Bell Telephone,et al.  ROBUST ESTIMATES, RESIDUALS, AND OUTLIER DETECTION WITH MULTIRESPONSE DATA , 1972 .

[18]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[19]  Stefan Van Aelst,et al.  Propagation of outliers in multivariate data , 2009, 0903.0447.

[20]  P. Embrechts,et al.  Extremes and Robustness: A Contradiction? , 2006 .

[21]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .

[22]  Victor J. Yohai,et al.  Composite Robust Estimators for Linear Mixed Models , 2014, 1407.2176.

[23]  Anthony C. Atkinson,et al.  Controlling the size of multivariate outlier tests with the MCD estimator of scatter , 2009, Stat. Comput..

[24]  Mia Hubert,et al.  A Robust Estimator of the Tail Index Based on an Exponential Regression Model , 2004 .

[25]  E. S. Pearson,et al.  THE EFFICIENCY OF STATISTICAL TOOLS AND A CRITERION FOR THE REJECTION OF OUTLYING OBSERVATIONS , 1936 .

[26]  Douglas M. Hawkins,et al.  Improved Feasible Solution Algorithms for High Breakdown Estimation , 1999 .

[27]  H. P. Lopuhaä ASYMPTOTICS OF REWEIGHTED ESTIMATORS OF MULTIVARIATE LOCATION AND SCATTER , 1999 .

[28]  Francisco J. Prieto,et al.  Multivariate Outlier Detection and Robust Covariance Matrix Estimation , 2001, Technometrics.

[29]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[30]  Andrew Clark,et al.  Robust Portfolio Construction , 2010 .

[31]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[32]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[33]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[34]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .