Multiple imputation for continuous variables using a Bayesian principal component analysis†

ABSTRACT We propose a multiple imputation method based on principal component analysis (PCA) to deal with incomplete continuous data. To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study and real data sets, the method is compared to two classical approaches: multiple imputation based on joint modelling and on fully conditional modelling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables and when the variables are highly correlated. In addition, it provides unbiased point estimates of quantities of interest, such as an expectation, a regression coefficient or a correlation coefficient, with a smaller mean squared error. Furthermore, the widths of the confidence intervals built for the quantities of interest are often smaller whilst ensuring a valid coverage.

[1]  H. Kiers Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables , 1991 .

[2]  B. Escofier Traitement simultané de variables qualitatives et quantitatives en analyse factorielle , 1979 .

[3]  Christian P. Robert,et al.  An introduction to the special issue “Joint IMS-ISBA meeting - MCMSki 4” , 2015, Stat. Comput..

[4]  A. Gelman,et al.  Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches , 2013 .

[5]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[6]  Julie Josse,et al.  Selecting the number of components in principal component analysis using cross-validation approximations , 2012, Comput. Stat. Data Anal..

[7]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[8]  Julie Josse,et al.  Handling missing values in exploratory multivariate data analysis methods , 2012 .

[9]  B. Efron,et al.  Empirical Bayes on vector observations: An extension of Stein's method , 1972 .

[10]  Gilbert Saporta,et al.  L'analyse des données , 1981 .

[11]  A. Gelman,et al.  ON THE STATIONARY DISTRIBUTION OF ITERATIVE IMPUTATIONS , 2010, 1012.2902.

[12]  Xiao-Li Meng,et al.  Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm , 1991 .

[13]  Douglas G Altman,et al.  Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines , 2009, BMC medical research methodology.

[14]  James R Carpenter,et al.  Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model , 2012, Statistical methods in medical research.

[15]  Julie Josse,et al.  Adaptive shrinkage of singular values , 2013, Statistics and Computing.

[16]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[17]  Jérôme Pagès,et al.  Multiple imputation in principal component analysis , 2011, Adv. Data Anal. Classif..

[18]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[19]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[20]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[21]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[22]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[23]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[24]  Julie Josse,et al.  Regularised PCA to denoise and visualise data , 2013, Stat. Comput..

[25]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[26]  Andrew B. Nobel,et al.  Reconstruction of a low-rank matrix in the presence of Gaussian noise , 2010, J. Multivar. Anal..

[27]  M. Greenacre,et al.  Multiple Correspondence Analysis and Related Methods , 2006 .

[28]  J. Pagès,et al.  Gestion des données manquantes en analyse en composantes principales , 2009 .

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[31]  J. Pagès Multiple Factor Analysis by Example Using R , 2014 .

[32]  Ian R White,et al.  Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods , 2012, BMC Medical Research Methodology.

[33]  P. Bühlmann,et al.  Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana , 2004, Genome Biology.

[34]  Julie Josse,et al.  A principal component method to impute missing values for mixed data , 2013, Adv. Data Anal. Classif..

[35]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[36]  S. Huet,et al.  Bootstrap Confidence Intervals In Nonlinear Regression Models When The Number of Observations is Fixed and The Variance Tends To 0. Application To Biadditive Models , 1999 .

[37]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[38]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[39]  H. Joe Generating random correlation matrices based on partial correlations , 2006 .

[40]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[41]  Andrew Gelman,et al.  Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches , 2014, Political Analysis.

[42]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[43]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .