Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework

BackgroundIn omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution.ResultsWe assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment.ConclusionsWe believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.

[1]  Jeff Gill,et al.  We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data , 2012, British Journal of Political Science.

[2]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[3]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[4]  Jérôme Pagès,et al.  Multiple factor analysis (AFMULT package) , 1994 .

[5]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[6]  Shinichi Nakagawa,et al.  Missing inaction: the dangers of ignoring missing data. , 2008, Trends in ecology & evolution.

[7]  Philippe Besse,et al.  Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical correlation analysis , 2009 .

[8]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[9]  Marie Reilly,et al.  Data analysis using hot deck multiple imputation , 1993 .

[10]  Arthur Tenenhaus,et al.  Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis , 2013, Eur. J. Oper. Res..

[11]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[12]  Roderick J A Little,et al.  A Review of Hot Deck Imputation for Survey Non‐response , 2010, International statistical review = Revue internationale de statistique.

[13]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[14]  Jérôme Pagès,et al.  Multiple imputation in principal component analysis , 2011, Adv. Data Anal. Classif..

[15]  Joe Whittaker,et al.  Application of the Parametric Bootstrap to Models that Incorporate a Singular Value Decomposition , 1995 .

[16]  Robert Sabatier,et al.  The ACT (STATIS method) , 1994 .

[17]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[18]  Pieter M. Kroonenberg,et al.  Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis , 2014, J. Classif..

[19]  C. Goodall Procrustes methods in the statistical analysis of shape , 1991 .

[20]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[21]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[22]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[23]  Stéphane Dray,et al.  The ade4 Package-II: Two-table and K-table Methods , 2007 .

[24]  G. Kalton,et al.  The treatment of missing survey data , 1986 .

[25]  Pierre R. Bushel,et al.  Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes , 2007, BMC Systems Biology.

[26]  Tammo H. A. Bijmolt,et al.  Generalized canonical correlation analysis of matrices with missing rows: a simulation study , 2006, Psychometrika.

[27]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[28]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[29]  Julie Josse,et al.  Handling missing values in exploratory multivariate data analysis methods , 2012 .

[30]  J. Weinstein,et al.  mRNA and microRNA Expression Profiles of the NCI-60 Integrated with Drug Activities , 2010, Molecular Cancer Therapeutics.

[31]  William C Reinhold,et al.  Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  J. Josse,et al.  Handling missing values in multiple factor analysis , 2013 .

[33]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[34]  K. Kohn,et al.  CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set. , 2012, Cancer research.

[35]  Therese D. Pigott,et al.  A Review of Methods for Missing Data , 2001 .