Multiple imputation in principal component analysis

The available methods to handle missing values in principal component analysis only provide point estimates of the parameters (axes and components) and estimates of the missing values. To take into account the variability due to missing values a multiple imputation method is proposed. First a method to generate multiple imputed data sets from a principal component analysis model is defined. Then, two ways to visualize the uncertainty due to missing values onto the principal component analysis results are described. The first one consists in projecting the imputed data sets onto a reference configuration as supplementary elements to assess the stability of the individuals (respectively of the variables). The second one consists in performing a principal component analysis on each imputed data set and fitting each obtained configuration onto the reference one with Procrustes rotation. The latter strategy allows to assess the variability of the principal component analysis parameters induced by the missing values. The methodology is then evaluated from a real data set.

[1]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[2]  Age K Smilde,et al.  Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. , 2007, The British journal of mathematical and statistical psychology.

[3]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[4]  Frederic Chateau,et al.  Assessing Sample Variability in the Visualization Techniques Related to Principal Component Analysis: Bootstrap and Alternative Simulation Methods , 1996 .

[5]  E Adams,et al.  Principal component analysis of dissolution data with missing elements. , 2002, International journal of pharmaceutics.

[6]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[7]  S. Zamir,et al.  Lower Rank Approximation of Matrices by Least Squares With Any Choice of Weights , 1979 .

[8]  Juha Karhunen,et al.  Principal Component Analysis for Sparse High-Dimensional Data , 2007, ICONIP.

[9]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[10]  R. Manne,et al.  Missing values in principal component analysis , 1998 .

[11]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[12]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[13]  Julie Josse,et al.  Selecting the number of components in principal component analysis using cross-validation approximations , 2012, Comput. Stat. Data Anal..

[14]  ten Josephus Berge,et al.  Review of: J.C. Gower & G.B. Dijksterhuis: Procrustes Problems, Oxford University Press. , 2004 .

[15]  Joe Whittaker,et al.  Application of the Parametric Bootstrap to Models that Incorporate a Singular Value Decomposition , 1995 .

[16]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[17]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[18]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[19]  J. B. Denis,et al.  Ajustements de modèles linéaires et bilinéaires sous contraintes linéaires avec données manquantes , 1991 .

[20]  Stef van Buuren,et al.  Multiple imputation of discrete and continuous data by fully conditional specification , 2007 .

[21]  J. Leeuw,et al.  Multidimensional Data Analysis , 1989 .

[22]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[23]  J. Pagès,et al.  Gestion des données manquantes en analyse en composantes principales , 2009 .

[24]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  B. Escofier,et al.  Analyses factorielles simples et multiples : objectifs, méthodes et interprétation , 2008 .

[27]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[28]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[29]  Tapani Raiko,et al.  Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values , 2022 .

[30]  Gene H. Golub,et al.  Matrix computations , 1983 .