Diagnosing Multivariate Outliers Detected by Robust Estimators

We propose a number of diagnostic methods that can be used whenever multiple outliers are identified by robust estimates for multivariate location and scatter. Their main purpose is visualization of the multivariate data to help determine whether the detected outliers (a) form separate clusters or (b) are isolated or randomly scattered (such as heavy tails compared with Gaussian). We make use of Mahalanobis distances and linear projections, to check for separation and to reveal additional aspects of the data structure. Several real data examples are analyzed, and artificial examples are used to illustrate the diagnostic power of the proposed plots. Code to perform the diagnostics, datasets used as examples in the article and documention are available in the online supplements.

[1]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[2]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[3]  Christian Hennig,et al.  Asymmetric Linear Dimension Reduction for Classification , 2004 .

[4]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[5]  Anthony C. Atkinson,et al.  Exploring Multivariate Data with the Forward Search , 2004 .

[6]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[7]  D. Hawkins Multivariate Statistics: A Practical Approach , 1990 .

[8]  Andreas Buja,et al.  XGobi: Interactive Dynamic Data Visualization in the X Window System , 1998 .

[9]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[10]  Ursula Gather,et al.  The largest nonindentifiable outlier: a comparison of multivariate simultaneous outlier identification rules , 2001 .

[11]  Tena I. Katsaounis,et al.  Exploring Multivariate Data With the Forward Search , 2006 .

[12]  H. Riedwyl,et al.  Multivariate Statistics: A Practical Approach , 1988 .

[13]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[14]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[15]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[16]  Christian Hennig Fuzzy and Crisp Mahalanobis Fixed Point Clusters , 2005, Data Analysis and Decision Support.

[17]  Hadley Wickham,et al.  ggplot: An implementation of the Grammar of Graphics in R , 2006 .

[18]  Duncan Temple Lang,et al.  GGobi: evolving from XGobi into an extensible framework for interactive data visualization , 2003, Comput. Stat. Data Anal..

[19]  M. Gallegos,et al.  A robust method for cluster analysis , 2005, math/0504513.

[20]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[21]  Andreas Buja,et al.  Grand tour and projection pursuit , 1995 .

[22]  Greet Pison,et al.  Diagnostic Plots for Robust Multivariate Methods , 2004 .

[23]  P. L. Davies,et al.  Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices , 1987 .

[24]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[25]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[26]  David J. Olive Applications of Robust Distances for Regression , 2002, Technometrics.