A Review of Integrative Imputation for Multi-Omics Datasets

Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets.

[1]  A. Morris,et al.  Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome‐wide association studies , 2020, Genetic epidemiology.

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Tao Jiang,et al.  SCALE method for single-cell ATAC-seq analysis via latent feature extraction , 2019, Nature Communications.

[4]  David van Dijk,et al.  Exploring single-cell data with deep multitasking neural networks , 2019, Nature Methods.

[5]  T. Lehtimäki,et al.  Integrative approaches for large-scale transcriptome-wide association studies , 2015, Nature Genetics.

[6]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[7]  Aji Prasetya Wibawa,et al.  K-Nearest Neighbor (K-NN) based Missing Data Imputation , 2019, 2019 5th International Conference on Science in Information Technology (ICSITech).

[8]  Tshilidzi Marwala,et al.  Missing data: A comparison of neural network and expectation maximization techniques , 2007 .

[9]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[10]  Zhigang Zhang,et al.  scIGANs: single-cell RNA-seq imputation using generative adversarial networks , 2020, bioRxiv.

[11]  Frank Dudbridge,et al.  Likelihood-Based Association Analysis for Nuclear Families and Unrelated Subjects with Missing Genotype Data , 2008, Human Heredity.

[12]  Xiang Zhou,et al.  Imputing missing RNA-seq data from DNA methylation by using transfer learning based neural network , 2019 .

[13]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[14]  A. Lusis,et al.  Systems genetics approaches to understand complex traits , 2013, Nature Reviews Genetics.

[15]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[16]  Feng Chen,et al.  TOBMI: trans‐omics block missing data imputation using a k‐nearest neighbor weighted approach , 2018, Bioinform..

[17]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[18]  Jingshu Wang,et al.  Data denoising with transfer learning in single-cell transcriptomics , 2019, Nature Methods.

[19]  Hongkai Ji,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome biology.

[20]  Qing Li,et al.  The Bayesian elastic net , 2010 .

[21]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[22]  Yu-Ping Wang,et al.  FISH: fast and accurate diploid genotype imputation via segmental hidden Markov model , 2014, Bioinform..

[23]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[24]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[25]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[26]  Brian L. Browning,et al.  A one penny imputed genome from next generation reference panels , 2018, bioRxiv.

[27]  Vanessa M. Peterson,et al.  Multiplexed quantification of proteins and transcripts in single cells , 2017, Nature Biotechnology.

[28]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[29]  Joel T. Dudley,et al.  Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits , 2019 .

[30]  Yi Yang,et al.  CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies , 2019, bioRxiv.

[31]  William Stafford Noble,et al.  PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition , 2018, Nature Communications.

[32]  Fabian J. Theis,et al.  Statistical single cell multi-omics integration , 2018 .

[33]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[34]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[35]  Fei Tang,et al.  Random forest missing data algorithms , 2017, Stat. Anal. Data Min..

[36]  Marylyn D. Ritchie,et al.  Evaluation of PrediXcan for prioritizing GWAS associations and predicting gene expression , 2017, PSB.

[37]  Thomas Bartz-Beielstein,et al.  imputeTS: Time Series Missing Value Imputation in R , 2017, R J..

[38]  Stephanie C. Hicks,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome Biology.

[39]  Pjanic Milos,et al.  A gene-based association method for mapping traits using reference transcriptome data through genetically regulated expression (GReX) component, PrediXcan , 2017 .

[40]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[41]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[42]  J. Marioni,et al.  MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data , 2020, Genome Biology.

[43]  Wei Pan,et al.  A Powerful Framework for Integrating eQTL and GWAS Summary Data , 2017, Genetics.

[44]  Jingjing Yang,et al.  TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits , 2018, bioRxiv.

[45]  D. Lin,et al.  Simple and efficient analysis of disease association with missing genotype data. , 2008, American journal of human genetics.

[46]  G. Sanguinetti,et al.  scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells , 2018, Nature Communications.

[47]  Richard D Smith,et al.  Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. , 2015, Journal of proteome research.

[48]  Guido Sanguinetti,et al.  Melissa: Bayesian clustering and imputation of single-cell methylomes , 2019, Genome Biology.

[49]  Ignacio González,et al.  Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework , 2016, BMC Bioinformatics.

[50]  Angshul Majumdar,et al.  AutoImpute: Autoencoder based imputation of single-cell RNA-seq data , 2018, Scientific Reports.

[51]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[52]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[53]  Beate Vieth,et al.  A systematic evaluation of single cell RNA-seq analysis pipelines , 2019, Nature Communications.

[54]  Scott T. Weiss,et al.  Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations , 2019, bioRxiv.

[55]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[56]  Laura L. Elo,et al.  A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation , 2017, Briefings Bioinform..

[57]  Xiang Zhou,et al.  VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies , 2018, Genome Biology.

[58]  Peter Goos,et al.  Sequential imputation for missing values , 2007, Comput. Biol. Chem..

[59]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[60]  Xiaotong Shen,et al.  A Powerful and Adaptive Association Test for Rare Variants , 2014, Genetics.

[61]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[62]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[63]  Kevin R. Moon,et al.  Exploring single-cell data with deep multitasking neural networks , 2017, Nature Methods.

[64]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[65]  Alexey M. Kozlov,et al.  Eleven grand challenges in single-cell data science , 2020, Genome Biology.

[66]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[67]  Hui Shen,et al.  A road map for understanding molecular and genetic determinants of osteoporosis , 2019, Nature Reviews Endocrinology.

[68]  Milton Pividori,et al.  Integrating predicted transcriptome from multiple tissues improves association detection , 2018, bioRxiv.

[69]  David A. Knowles,et al.  RNA splicing is a primary link between genetic variation and disease , 2016, Science.

[70]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[71]  Hongyu Zhao,et al.  A statistical framework for cross-tissue transcriptome-wide association analysis , 2018, Nature Genetics.

[72]  Jeff A. Bilmes,et al.  Multi-scale deep tensor factorization learns a latent representation of the human epigenome , 2018, bioRxiv.

[73]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[74]  Laurent Gatto,et al.  Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. , 2016, Journal of proteome research.

[75]  Lana X. Garmire,et al.  DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data , 2018, Genome Biology.

[76]  Dan L Nicolae,et al.  Testing Untyped Alleles (TUNA)—applications to genome‐wide association studies , 2006, Genetic epidemiology.

[77]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[78]  Dieter William Joenssen,et al.  Hot Deck Methods for Imputing Missing Data - The Effects of Limiting Donor Usage , 2012, MLDM.

[79]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[80]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[81]  Bilal Mirza,et al.  Machine Learning and Integrative Analysis of Biomedical Big Data , 2019, Genes.

[82]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[83]  Songpeng Zu,et al.  SIMPLEs: a single-cell RNA sequencing imputation strategy preserving gene modules and cell clusters variation , 2020, bioRxiv.

[84]  H. Swerdlow,et al.  Large-scale simultaneous measurement of epitopes and transcriptomes in single cells , 2017, bioRxiv.

[85]  Xinghua Shi,et al.  Sparse Convolutional Denoising Autoencoders for Genotype Imputation , 2019, Genes.

[86]  Dongdong Lin,et al.  An integrative imputation method based on multi-omics datasets , 2016, BMC Bioinformatics.

[87]  Yi Yang,et al.  CoMM: A Collaborative Mixed Model That Integrates GWAS and eQTL Data Sets to Investigate the Genetic Architecture of Complex Traits , 2019, Bioinformatics and biology insights.

[88]  Hong Yan,et al.  Autoregressive-Model-Based Missing Value Estimation for DNA Microarray Time Series Data , 2009, IEEE Transactions on Information Technology in Biomedicine.

[89]  Todd L Edwards,et al.  Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics , 2018, Nature Communications.

[90]  Dinggang Shen,et al.  Late Fusion Incomplete Multi-View Clustering , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Ruedi Aebersold,et al.  A Mass Spectrometric-Derived Cell Surface Protein Atlas , 2015, PloS one.

[92]  Il-Youp Kwak,et al.  DrImpute: imputing dropout events in single cell RNA sequencing data , 2017, BMC Bioinformatics.

[93]  Ying Guo,et al.  Single Cell Multi-Omics Technology: Methodology and Application , 2018, Front. Cell Dev. Biol..

[94]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[95]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[96]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[97]  Brian L Browning,et al.  Genotype Imputation from Large Reference Panels. , 2018, Annual review of genomics and human genetics.

[98]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[99]  R. Satija,et al.  Integrative single-cell analysis , 2019, Nature Reviews Genetics.

[100]  Jingshu Wang,et al.  Surface protein imputation from single cell transcriptomes by deep neural networks , 2019, Nature Communications.

[101]  Stephan Beck,et al.  Making multi-omics data accessible to researchers , 2019, Scientific Data.

[102]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.