Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study

As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute unique information, possibly improving predictions compared to using only one omics data type. Frequently, however, in the training data and the data to which automatic prediction rules should be applied, the test data, the different omics data types are not available for all patients. We refer to this type of data as block-wise missing multi-omics data. First, we provide a literature review on existing prediction methods applicable to such data. Subsequently, using a collection of 13 publicly available multi-omics data sets, we compare the predictive performances of several of these approaches for different block-wise missingness patterns. Finally, we discuss the results of this empirical comparison study and draw some tentative conclusions.

[1]  Germany,et al.  Over‐optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results , 2021, WIREs Data Mining Knowl. Discov..

[2]  Zhiqi Bu,et al.  Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems , 2021, 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA).

[3]  Chih-Ling Tsai,et al.  Imputations for High Missing Rate Data in Covariates Via Semi-supervised Learning Approach , 2021, Journal of Business & Economic Statistics.

[4]  D. Shen,et al.  High-Order Laplacian Regularized Low-Rank Representation for Multimodal Dementia Diagnosis , 2021, Frontiers in Neuroscience.

[5]  Qiujun Lan,et al.  A method of credit evaluation modeling based on block-wise missing data , 2021, Applied Intelligence.

[6]  Moritz Herrmann,et al.  Large-scale benchmark study of survival prediction methods using multi-omics data , 2020, Briefings Bioinform..

[7]  Aidong Zhang,et al.  HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness , 2020, KDD.

[8]  Frederik Ludwigs,et al.  A comparison study of prediction approaches for multiple training data sets and test data with block-wise missing values , 2020 .

[9]  Michelle Taub,et al.  PRIME: Block-Wise Missingness Handling for Multi-modalities in Intelligent Tutoring Systems , 2019, MMM.

[10]  Fei Xue,et al.  Integrating Multisource Block-Wise Missing Data in Model Selection , 2019, Journal of the American Statistical Association.

[11]  Nian-Sheng Tang,et al.  Imputed Factor Regression for High-dimensional Block-wise Missing Data , 2020 .

[12]  Henry Linder,et al.  Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes , 2019, Communications for Statistical Applications and Methods.

[13]  Roman Hornung,et al.  Block Forests: random forests for blocks of clinical and omics covariate data , 2019, BMC Bioinformatics.

[14]  Fabian J Theis,et al.  A strategy for high‐dimensional multivariable analysis classifies childhood asthma phenotypes from genetic, immunological, and environmental factors , 2019, Allergy.

[15]  Rodolphe Thiébaut,et al.  Supervised Learning for Multi-Block Incomplete Data , 2019, ArXiv.

[16]  Feng Chen,et al.  TOBMI: trans‐omics block missing data imputation using a k‐nearest neighbor weighted approach , 2018, Bioinform..

[17]  A. Boulesteix,et al.  Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data , 2018, BMC Bioinformatics.

[18]  Eric F Lock,et al.  Generalized integrative principal component analysis for multi-type data with block-wise missing structure. , 2018, Biostatistics.

[19]  Dinggang Shen,et al.  Multi-Hypergraph Learning for Incomplete Multimodality Data , 2018, IEEE Journal of Biomedical and Health Informatics.

[20]  Anne-Laure Boulesteix,et al.  On the necessity and design of studies comparing statistical methods , 2018, Biometrical journal. Biometrische Zeitschrift.

[21]  Norbert Krautenbacher,et al.  Learning on complex, biased, and big data: disease risk prediction in epidemiological studies and genomic medicine on the example of childhood asthma , 2018 .

[22]  Rory Wilson,et al.  Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies , 2017, BMC Medical Research Methodology.

[23]  Xiaosheng Wang,et al.  TP53 mutations, expression and interaction networks in human cancers , 2016, Oncotarget.

[24]  David Causeur,et al.  Improving cross‐study prediction through addon batch effect adjustment or addon normalization , 2016, Bioinform..

[25]  Richard F. Schlenl,et al.  Integrating multiple molecular sources into a clinical risk prediction signature by extracting complementary information , 2016, BMC Bioinformatics.

[26]  Anru Zhang,et al.  Structured Matrix Completion with Applications to Genomic Data Integration , 2015, Journal of the American Statistical Association.

[27]  A. Boulesteix,et al.  Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment , 2016, BMC Bioinformatics.

[28]  Paul M. Thompson,et al.  Bi-level multi-source learning for heterogeneous block-wise missing data , 2014, NeuroImage.

[29]  Dinggang Shen,et al.  Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion , 2014, NeuroImage.

[30]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[31]  Luke Bloy,et al.  Using Multiparametric Data with Missing Features for Learning Patterns of Pathology , 2012, MICCAI.

[32]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[33]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[34]  T. Chu,et al.  Principal Variance Components Analysis: Estimating Batch Effects in Microarray Gene Expression Data , 2009 .

[35]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .