Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns.

Principal component analysis (PCA) is a popular statistical tool. However, despite numerous advantages, the good practice of imputing missing data before PCA is not common. In the present work, we evaluated the hypothesis that the expectation-maximization (EM) algorithm for missing data imputation is a reliable and advantageous procedure when using PCA to derive biomarker profiles and dietary patterns. To this aim, we used numerical simulations aimed to mimic real data commonly observed in nutritional research. Finally, we showed the advantages and pitfalls of the EM algorithm for missing data imputation applied to plasma fatty acid concentrations and nutrient intakes from real data sets deriving from the US National Health and Nutrition Examination Survey. PCA applied to simulated data having missing values resulted in biased eigenvalues with respect to the original data set without missing values. The bias between the eigenvalues from the original set of data and from the data set with missing values increased with number of missing values and appeared as independent with respect to the correlation structure among variables. On the other hand, when data were imputed, the mean of the eigenvalues over the 10 missing imputation runs overlapped with the ones derived from the PCA applied to the original data set. These results were confirmed when real data sets from the National Health and Nutrition Examination Survey were analyzed. We accept the hypothesis that the EM algorithm for missing data imputation applied before PCA aimed to derive biochemical profiles and dietary patterns is an effective technique especially for relatively small sample sizes.

[1]  Julie Josse,et al.  Principal component analysis with missing values: a comparative survey of methods , 2015, Plant Ecology.

[2]  Paul Golder,et al.  The Guttman-Kaiser Criterion as a Predictor of the Number of Common Factors , 1982 .

[3]  W. Velicer,et al.  Comparison of five rules for determining the number of components to retain. , 1986 .

[4]  Chih-Fong Tsai,et al.  Missing value imputation: a review and analysis of the literature (2006–2017) , 2019, Artificial Intelligence Review.

[5]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[6]  Frank B. Hu,et al.  Dietary pattern analysis: a new direction in nutritional epidemiology , 2002, Current opinion in lipidology.

[7]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[8]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[9]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[10]  Ting Hsiang Lin,et al.  A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data , 2010 .

[11]  Per Winkel,et al.  When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts , 2017, BMC Medical Research Methodology.

[12]  Chuong B Do,et al.  What is the expectation maximization algorithm? , 2008, Nature Biotechnology.

[13]  C. Ricci,et al.  Food or nutrient pattern assessment using the principal component analysis applied to food questionnaires. Pitfalls, tips and tricks , 2019, International journal of food sciences and nutrition.

[14]  Carol M Musil,et al.  A Comparison of Imputation Techniques for Handling Missing Data , 2002, Western journal of nursing research.

[15]  D. Massart,et al.  Dealing with missing data , 2001 .

[16]  Hua-Liang Wei,et al.  Handling missing data in multivariate time series using a vector autoregressive model-imputation (VAR-IM) algorithm , 2018, Neurocomputing.

[17]  Peter M. Bentler,et al.  Treatments of Missing Data: A Monte Carlo Comparison of RBHDI, Iterative Stochastic Regression Imputation, and Expectation-Maximization , 2000 .

[18]  Cattram Nguyen,et al.  Missing data in FFQs: making assumptions about item non-response , 2016, Public Health Nutrition.

[19]  Robert Powers,et al.  Multivariate Analysis in Metabolomics. , 2012, Current Metabolomics.

[20]  A. Buja,et al.  Remarks on Parallel Analysis. , 1992, Multivariate behavioral research.

[21]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[22]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.