Fast cross-validation of high-breakdown resampling methods for PCA

Cross-validation (CV) is a very popular technique for model selection and model validation. The general procedure of leave-one-out CV (LOO-CV) is to exclude one observation from the data set, to construct the fit of the remaining observations and to evaluate that fit on the item that was left out. In classical procedures such as least-squares regression or kernel density estimation, easy formulas can be derived to compute this CV fit or the residuals of the removed observations. However, when high-breakdown resampling algorithms are used, it is no longer possible to derive such closed-form expressions. High-breakdown methods are developed to obtain estimates that can withstand the effects of outlying observations. Fast algorithms are presented for LOO-CV when using a high-breakdown method based on resampling, in the context of robust covariance estimation by means of the MCD estimator and robust principal component analysis. A robust PRESS curve is introduced as an exploratory tool to select the number of principal components. Simulation results and applications on real data show the accuracy and the gain in computation time of these fast CV algorithms.

[1]  M. Hubert,et al.  Robust classification in high dimensions based on the SIMCA Method , 2005 .

[2]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[3]  Mia Hubert,et al.  Fast model selection for robust calibration methods , 2005 .

[4]  Desire L. Massart,et al.  Kernel-PCA algorithms for wide data Part II: Fast cross-validation and application in classification of NIR data , 1997 .

[5]  A. McQuarrie,et al.  Regression and Time Series Model Selection , 1998 .

[6]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[7]  Mia Hubert,et al.  Robust PCA and classification in biosciences , 2004, Bioinform..

[8]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[9]  Mia Hubert,et al.  LIBRA: a MATLAB library for robust analysis , 2005 .

[10]  Elvezio Ronchetti,et al.  Robust Linear Model Selection by Cross-Validation , 1997 .

[11]  P. L. Davies,et al.  Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices , 1987 .

[12]  Peter J. Rousseeuw,et al.  ROBUST REGRESSION BY MEANS OF S-ESTIMATORS , 1984 .

[13]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[14]  Kristof Mertens,et al.  Visible transmission spectroscopy for the assessment of egg freshness , 2006 .

[15]  J RousseeuwPeter,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[16]  Mia Hubert,et al.  Computational Statistics and Data Analysis Robust Pca for Skewed Data and Its Outlier Map , 2022 .

[17]  S. D. Jong,et al.  The kernel PCA algorithms for wide data. Part I: Theory and algorithms , 1997 .

[18]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[20]  Sanne Engelen,et al.  A comparison of three procedures for robust PCA in high dimensions , 2016 .

[21]  Elvezio Ronchetti,et al.  A Robust Version of Mallows's C P , 1994 .