Sparse Principal Component Analysis Based on Least Trimmed Squares

ABSTRACT Sparse principal component analysis (PCA) is used to obtain stable and interpretable principal components (PCs) from high-dimensional data. A robust sparse PCA method is proposed to handle potential outliers in the data. The proposed method is based on the least trimmed squares PCA method which provides robust but non-sparse PC estimates. To obtain sparse solutions, our method incorporates a regularization penalty on the loading vectors. The principal directions are determined sequentially to avoid that outliers in the PC subspace destroy the sparse structure of the loadings. Simulation studies and real data examples show that the new method gives accurate estimates, even when the data are highly contaminated. Moreover, compared to existing robust sparse PCA methods the computation time is reduced to a great extent. Supplementary materials providing more simulation results and discussion, and an R package to compute the proposed method are available online.

[1]  Vincent Q. Vu,et al.  Sparsistency and agnostic inference in sparse PCA , 2014, 1401.6978.

[2]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[3]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[4]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[5]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[6]  Volodymyr Kuleshov,et al.  Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration , 2013, ICML.

[7]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[8]  Stefan Van Aelst,et al.  Propagation of outliers in multivariate data , 2009, 0903.0447.

[9]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[10]  G. Box Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I. Effect of Inequality of Variance in the One-Way Classification , 1954 .

[11]  Stefan Van Aelst,et al.  Fast computation of robust subspace estimators , 2018, Comput. Stat. Data Anal..

[12]  Ricardo A. Maronna,et al.  Principal Components and Orthogonal Regression Based on Robust Scales , 2005, Technometrics.

[13]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[14]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[15]  John F. MacGregor,et al.  Multivariate SPC charts for monitoring batch processes , 1995 .

[16]  Pascal Lemberge,et al.  Quantitative analysis of 16–17th century archaeological glass vessels using PLS regression of EPXMA and µ‐XRF data , 2000 .

[17]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[18]  Peter Filzmoser,et al.  Robust Sparse Principal Component Analysis , 2013, Technometrics.

[19]  Dan Shen,et al.  Consistency of sparse PCA in High Dimension, Low Sample Size contexts , 2011, J. Multivar. Anal..

[20]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[21]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[22]  Mia Hubert,et al.  Sparse PCA for High-Dimensional Data With Outliers , 2016, Technometrics.

[23]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[24]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[25]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .