Gestion des données manquantes en analyse en composantes principales

An approach commonly used to handle missing values in Principal Component Analysis (PCA) consists in ignoring the missing values by optimizing the loss function over all non-missing ele- ments. This can be achieved by several methods, including the use of NIPALS, weighted regression or iterative PCA. The latter is based on iterative imputation of the missing elements during the es- timation of the parameters, and can be seen as a particular EM algorithm. First, we review theses approaches with respect to the criterion minimization. This presentation gives a good understanding of their properties and the difficulties encountered. Then, we point out the problem of overfitting and we show how the probabilistic formulation of PCA (Tipping & Bishop, 1997) offers a proper and convenient regularization term to overcome this problem. Finally, the performances of the new algorithm are compared to those of the other algorithms from simulations.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[3]  M. Healy,et al.  Missing Values in Experiments Analysed on Automatic Computers , 1956 .

[4]  Y. Escoufier LE TRAITEMENT DES VARIABLES VECTORIELLES , 1973 .

[5]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  S. Zamir,et al.  Lower Rank Approximation of Matrices by Least Squares With Any Choice of Weights , 1979 .

[8]  J. C. van Houwelingen,et al.  An Application of Factor Analysis With Missing Data , 1981 .

[9]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[10]  Gene H. Golub,et al.  Matrix computations , 1983 .

[11]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[12]  J. B. Denis,et al.  Ajustements de modèles linéaires et bilinéaires sous contraintes linéaires avec données manquantes , 1991 .

[13]  Darren T. Andrews,et al.  Maximum likelihood principal component analysis , 1997 .

[14]  Sam T. Roweis,et al.  EM Algorithms for PCA and Sensible PCA , 1997, NIPS 1997.

[15]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[16]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[17]  R. Manne,et al.  Missing values in principal component analysis , 1998 .

[18]  Rasmus Bro,et al.  MULTI-WAY ANALYSIS IN THE FOOD INDUSTRY Models, Algorithms & Applications , 1998 .

[19]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[20]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[21]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[22]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.

[23]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[24]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[25]  Juha Karhunen,et al.  Principal Component Analysis for Sparse High-Dimensional Data , 2007, ICONIP.

[26]  Benjamin M. Marlin,et al.  Missing Data Problems in Machine Learning , 2008 .

[27]  Jérôme Pagès,et al.  Testing the significance of the RV coefficient , 2008, Comput. Stat. Data Anal..

[28]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.