Selecting the number of components in principal component analysis using cross-validation approximations

Cross-validation is a tried and tested approach to select the number of components in principal component analysis (PCA), however, its main drawback is its computational cost. In a regression (or in a non parametric regression) setting, criteria such as the general cross-validation one (GCV) provide convenient approximations to leave-one-out cross-validation. They are based on the relation between the prediction error and the residual sum of squares weighted by elements of a projection matrix (or a smoothing matrix). Such a relation is then established in PCA using an original presentation of PCA with a unique projection matrix. It enables the definition of two cross-validation approximation criteria: the smoothing approximation of the cross-validation criterion (SACV) and the GCV criterion. The method is assessed with simulations and gives promising results.

[1]  Jérôme Pagès,et al.  Multiple imputation in principal component analysis , 2011, Adv. Data Anal. Classif..

[2]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[3]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[4]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[5]  Age K Smilde,et al.  Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. , 2007, The British journal of mathematical and statistical psychology.

[6]  Diana Maria Sima,et al.  Regularization Techniques in Model Fitting and Parameter Estimation (Regularisatietechnieken in modellering en parameterschatting) , 2006 .

[7]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[8]  Yuedong Wang,et al.  Smoothing Spline Nonlinear Nonparametric Regression Models , 2004 .

[9]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[10]  A. Pázman,et al.  Measures of nonlinearity for a biadditive ANOVA model , 2002 .

[11]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[12]  L. Ferré Selection of components in principal component analysis: a comparison of methods , 1995 .

[13]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[14]  Grace Wahba,et al.  A cross validated bayesian retrieval algorithm for nonlinear remote sensing experiments , 1985 .

[15]  W. Krzanowski,et al.  Cross-Validatory Choice of the Number of Components From a Principal Component Analysis , 1982 .

[16]  S. Zamir,et al.  Lower Rank Approximation of Matrices by Least Squares With Any Choice of Weights , 1979 .

[17]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[18]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[19]  Y. Escoufier LE TRAITEMENT DES VARIABLES VECTORIELLES , 1973 .

[20]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[21]  J. Pagès,et al.  Gestion des données manquantes en analyse en composantes principales , 2009 .

[22]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[23]  P. Besse,et al.  Sur l'usage de la validation croisée en analyse en composantes principales , 1993 .