TOBMI: trans‐omics block missing data imputation using a k‐nearest neighbor weighted approach

Motivation Stitching together trans‐omics data is a powerful approach to assess the complex mechanisms of cancer occurrence, progression and treatment. However, the integration process suffers from the ‘block missing’ phenomena when part of individuals lacks some omics data. Results We proposed a k‐nearest neighbor (kNN) weighted imputation method for trans‐omics block missing data (TOBMIkNN) to handle gene‐absence individuals in RNA‐seq datasets using external information obtained from DNA methylation probe datasets. Referencing to multi‐hot deck, mean imputation and missing cases deletion, we assess the relative error, absolute error, inter‐omics correlation structure change and variable selection. The proposed method, TOBMIkNN reliably imputed RNA‐seq data by borrowing information from DNA methylation data, and showed superiority over the other three methods in imputation error and stability of correlation structure. Our study indicates that TOBMIkNN can be used as an advisable method for trans‐omics block missing data imputation. Availability and implementation TOBMIkNN is freely available at https://github.com/XuesiDong/TOBMI. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Guohui Lin,et al.  Iterated Local Least Squares Microarray Missing Value Imputation , 2006, J. Bioinform. Comput. Biol..

[3]  Luciano Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[4]  Gerard M Schippers,et al.  Missing Data Approaches in eHealth Research: Simulation Study and a Tutorial for Nonmathematically Inclined Researchers , 2010, Journal of medical Internet research.

[5]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[6]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[7]  Jeff Gill,et al.  We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data , 2012, British Journal of Political Science.

[8]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[9]  G. Siuzdak,et al.  Innovation: Metabolomics: the apogee of the omics trilogy , 2012, Nature Reviews Molecular Cell Biology.

[10]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[11]  G. Getz,et al.  Inferring tumour purity and stromal and immune cell admixture from expression data , 2013, Nature Communications.

[12]  Bin Nan,et al.  A Hot‐Deck Multiple Imputation Procedure for Gaps in Longitudinal Recurrent Event Histories , 2011, Biometrics.

[13]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[14]  C. Plass,et al.  Pan-cancer patterns of DNA methylation , 2014, Genome Medicine.

[15]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[16]  Shinichi Nakagawa,et al.  Missing inaction: the dangers of ignoring missing data. , 2008, Trends in ecology & evolution.

[17]  Kyoungmi Kim,et al.  Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices , 2016, Briefings Bioinform..

[18]  Upmanu Lall,et al.  A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series , 1996 .

[19]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[20]  Hiroyuki Kubota,et al.  Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple 'Omic' Layers. , 2016, Trends in biotechnology.

[21]  Nathalie Villa-Vialaneix,et al.  Multiple hot‐deck imputation for network inference from RNA sequencing data , 2018, Bioinform..

[22]  Matthew A. Hibbs,et al.  Visualization of omics data for systems biology , 2010, Nature Methods.

[23]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[24]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[25]  Roberto Todeschini,et al.  Assessing the validity of QSARs for ready biodegradability of chemicals: an applicability domain perspective. , 2014, Current computer-aided drug design.

[26]  Rachel B. Brem,et al.  Stitching together Multiple Data Dimensions Reveals Interacting Metabolomic and Transcriptomic Networks That Modulate Cell Regulation , 2012, PLoS biology.

[27]  Elias Campo Guerri,et al.  International network of cancer genome projects , 2010 .

[28]  Simona Soverini,et al.  Comparison of Next-Generation Sequencing Systems , 2013 .

[29]  R. Myers,et al.  Candidate-gene approaches for studying complex genetic traits: practical considerations , 2002, Nature Reviews Genetics.