Distance-based outlier detection for high dimension, low sample size data

ABSTRACT Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers.

[1]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[2]  M. R. Srinivasan,et al.  Outlier detection for high dimensional data using the Comedian approach , 2012 .

[3]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[4]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[5]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[6]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[7]  R. Garrett The chi-square plot: a tool for multivariate outlier recognition , 1989 .

[8]  Yongho Jeon,et al.  HDLSS Discrimination With Adaptive Data Piling , 2013 .

[9]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[10]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[11]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[12]  C. Klaassen,et al.  Efficient estimation in the bivariate normal copula model: normal margins are least favourable , 1997 .

[13]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[14]  Jeongyoun Ahn,et al.  CLUSTERING HIGH DIMENSION, LOW SAMPLE SIZE DATA USING THE MAXIMAL DATA PILING DISTANCE , 2012 .

[15]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[16]  M. R. Srinivasan,et al.  An Overview of Multiple Outliers in Multidimensional Data , 2013 .

[17]  Gentiane Haesbroeck,et al.  Outliers detection with the minimum covariance determinant estimator in practice , 2009 .

[18]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[19]  M J van der Laan,et al.  Gene expression analysis with the parametric bootstrap. , 2001, Biostatistics.

[20]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[21]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[22]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[23]  J. Marron,et al.  The maximal data piling direction for discrimination , 2010 .