Assessment of maximum likelihood PCA missing data imputation

Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[2]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[3]  Alberto Ferrer,et al.  Building covariance matrices with the desired structure , 2013 .

[4]  Alain Vande Wouwer,et al.  Stoichiometric identification with maximum likelihood principal component analysis , 2013, Journal of mathematical biology.

[5]  T. Hogg,et al.  Multiple imputation and maximum likelihood principal component analysis of incomplete multivariate data from a study of the ageing of port , 2001 .

[6]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[7]  Silvia Lanteri,et al.  Classification of olive oils from their fatty acid composition , 1983 .

[8]  Willem Windig,et al.  Infrared Chemical Micro-Imaging Assisted by Interactive Self-Modeling Multivariate Analysis , 1994 .

[9]  Alberto José Ferrer Riquelme,et al.  Monitorización de procesos multivariantes con datos faltantes mediante Análisis de Componentes Principales , 2003 .

[10]  Abel Folch-Fortuny,et al.  Metabolic Flux Understanding of Pichia pastoris Grown on Heterogenous , 2014 .

[11]  Francisco Javier Arteaga Moreno Control estadístico multivariante de procesos con datos faltantes mediante análisis de componentes principales , 2003 .

[12]  A. Ferrer,et al.  Dealing with missing data in MSPC: several methods, different interpretations, some examples , 2002 .

[13]  Age K. Smilde,et al.  Maximum likelihood scaling (MALS) , 2006 .

[14]  P. Wentzell,et al.  Characterization of the measurement error structure in 1D 1H NMR data for metabolomics studies. , 2009, Analytica chimica acta.

[15]  A. Ferrer,et al.  PCA model building with missing data: New proposals and a comparative study , 2015 .

[16]  Alberto Ferrer,et al.  How to simulate normal data sets with the desired correlation structure , 2010 .

[17]  I. Stanimirova Practical approaches to principal component analysis for simultaneously dealing with missing and censored elements in chemical data. , 2013, Analytica chimica acta.

[18]  Romà Tauler,et al.  Maximum Likelihood Principal Component Analysis as initial projection step in Multivariate Curve Resolution analysis of noisy data , 2012 .

[19]  W. Windig Spectral data files for self-modeling curve resolution with examples using the Simplisma approach , 1997 .

[20]  Simple-to-use interactive self-modeling mixture analysis of FTIR microscopy data , 1993 .

[21]  In-Beum Lee,et al.  Fault Detection Based on a Maximum-Likelihood Principal Component Analysis (PCA) Mixture , 2005 .

[22]  Peter D. Wentzell,et al.  Exploratory data analysis with noisy measurements , 2012 .

[23]  Philip R. Nelson,et al.  The Treatment Of Missing Measurements In PCA And PLS Models , 2002 .

[24]  S. A. bano C. D. nn W. I. i Wold,et al.  Pattern recognition: finding and using regularities in multivariate data Food research, how to relate sets of measurements or observations to each other , 1983 .

[25]  Michael R. Keenan Maximum likelihood principal component analysis of time-of-flight secondary ion mass spectrometry spectral images , 2005 .

[26]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[27]  Darren T. Andrews,et al.  Maximum likelihood principal component analysis , 1997 .

[28]  Abel Folch-Fortuny,et al.  Missing Data Imputation Toolbox for MATLAB , 2016 .

[29]  Peter D. Wentzell,et al.  Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer , 1997 .

[30]  Reflectance FTIR Microspectroscopy for Studying Effect of Xylan Removal on Unbleached and Bleached Birch Kraft Pulps , 2002 .

[31]  Alberto Ferrer,et al.  Framework for regression‐based missing data imputation methods in on‐line MSPC , 2005 .

[32]  D. Massart,et al.  Dealing with missing data , 2001 .

[33]  Rodrigo López‐Negrete de la Fuente,et al.  An efficient nonlinear programming strategy for PCA models with incomplete data sets , 2010 .