Outlier identification in high dimensions

A computationally fast procedure for identifying outliers is presented that is particularly effective in high dimensions. This algorithm utilizes simple properties of principal components to identify outliers in the transformed space, leading to significant computational advantages for high-dimensional data. This approach requires considerably less computational time than existing methods for outlier detection, and is suitable for use on very large data sets. It is also capable of analyzing the data situation commonly found in certain biological applications in which the number of dimensions is several orders of magnitude larger than the number of observations. The performance of this method is illustrated on real and simulated data with dimension ranging in the thousands.

[1]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[2]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[3]  Victor J. Yohai,et al.  The Behavior of the Stahel-Donoho Robust Multivariate Estimator , 1995 .

[4]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate Outlier Identification Rules , 1999 .

[7]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[8]  Pascal Lemberge,et al.  Quantitative analysis of 16–17th century archaeological glass vessels using PLS regression of EPXMA and µ‐XRF data , 2000 .

[9]  Clemens Reimann,et al.  Multivariate outlier detection in exploration geochemistry , 2005, Comput. Geosci..

[10]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate OutlierIdenti cation , 1997 .

[11]  D. Berry,et al.  Statistics: Theory and Methods , 1990 .

[12]  Bart W. Stuck,et al.  A Computer and Communication Network Performance Analysis Primer (Prentice Hall, Englewood Cliffs, NJ, 1985; revised, 1987) , 1987, Int. CMG Conference.

[13]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[14]  C. Croux,et al.  Generalizing univariate signed rank statistics for testing and estimating a multivariate location parameter , 1995 .

[15]  D. G. Simpson,et al.  Robust principal component analysis for functional data , 2007 .

[16]  V. Yohai,et al.  Projection estimates of multivariate location , 2002 .

[17]  Michel Tenenhaus,et al.  PLS path modeling , 2005, Comput. Stat. Data Anal..

[18]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[19]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[20]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[21]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[22]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[23]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[24]  Bell Telephone,et al.  ROBUST ESTIMATES, RESIDUALS, AND OUTLIER DETECTION WITH MULTIRESPONSE DATA , 1972 .

[25]  P. L. Davies,et al.  Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices , 1987 .

[26]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[27]  Francisco J. Prieto,et al.  Multivariate Outlier Detection and Robust Covariance Matrix Estimation , 2001, Technometrics.

[28]  K. Janssens,et al.  Composition of 15-17th century archaeological glass vessels excavated in Antwerp, Belgium , 1998 .

[29]  R. Fuge Environmental geochemical Atlas of the Central Barents region , 1999 .

[30]  Peter Filzmoser,et al.  Partial robust M-regression , 2005 .

[31]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[32]  Georg Ch. Pflug,et al.  Mathematical statistics and applications , 1985 .

[33]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[34]  David M. Rocke,et al.  Computable Robust Estimation of Multivariate Location and Shape in High Dimension Using Compound Estimators , 1994 .