rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data

Transcriptomics and metabolomics data often contain missing values or outliers due to limitations of the data acquisition techniques. Most of the statistical methods require complete datasets for downstream analysis. A number of methods have been developed for missing value imputation using the classical mean and variance based on maximum likelihood estimators, which are not robust against outliers. Consequently, the performance of these methods deteriorates in the presence of outliers. Hence precise imputation of missing values and outliers handling are both concurrently important. Therefore, in this paper, we developed a robust iterative approach using robust estimators based on the minimum beta divergence method, which simultaneously impute missing values and outliers. We investigate the performance of the proposed method in a comparison with six frequently used missing value imputation methods such as Zero, KNN, robust SVD, EM, random forest (RF) and weighted least square approach (WLSA) through feature selection using both simulated and real datasets. Ten performance indices were used to explore the optimal method such as Frobenius norm (FOBN), accuracy (ACC), sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), detection rate (DR), misclassification error rate (MER), the area under the ROC curve (AUC) and computational runtime. Evaluation based on both simulated and real data suggests the superiority of the proposed method over the other traditional methods in terms of various rates of outliers and missing values. The suggested approach also keeps almost equal performance in absence of outliers with the other methods. The proposed method is accurate, simple, and consumes lower computational time compared to the other methods. Therefore, our recommendation is to apply the proposed procedure for large-scale transcriptomics and metabolomics data analysis. The computational tool has been implemented in an R package, which is publicly available from https://CRAN.R-project.org/package=rMisbeta.

[1]  Peng Xiao,et al.  Hotelling’s T 2 multivariate profiling for detecting differential expression in microarrays , 2005 .

[2]  Md. Nurul Haque Mollah,et al.  A Robust Approach for Identification of Cancer Biomarkers and Candidate Drugs , 2018, Medicina.

[3]  Siqun Wang,et al.  Microarray analysis in drug discovery and clinical applications. , 2006, Methods in molecular biology.

[4]  Jorge S Reis-Filho,et al.  Microarrays in the 2010s: the contribution of microarray-based gene expression profiling to breast cancer classification, prognostication and prediction , 2011, Breast Cancer Research.

[5]  E. Wolski,et al.  Normalization strategies for cDNA microarrays. , 2000, Nucleic acids research.

[6]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[7]  Wei-Sheng Wu,et al.  Missing value imputation for microarray data: a comprehensive comparison study and a web tool , 2013, BMC Systems Biology.

[8]  E. Winzeler,et al.  Genomics, gene expression and DNA arrays , 2000, Nature.

[9]  Md. Nurul Haque Mollah,et al.  Robust Feature Selection Approach for Patient Classification using Gene Expression Data , 2017, Bioinformation.

[10]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[11]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[12]  Mihoko Minami,et al.  Robust Prewhitening for ICA by Minimizing β-Divergence and Its Application to FastICA , 2007, Neural Processing Letters.

[13]  Md. Nurul Haque Mollah,et al.  Metabolomic Biomarker Identification in Presence of Outliers and Missing Values , 2017, BioMed research international.

[14]  D. Massart,et al.  Dealing with missing data , 2001 .

[15]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[16]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[17]  Othman Soufan,et al.  NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis , 2019, Nucleic Acids Res..

[18]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[19]  Mihoko Minami,et al.  Robust extraction of local structures by the minimum beta-divergence method , 2010, Neural Networks.

[20]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[21]  Zijiang Yang,et al.  PLS-Based Gene Selection and Identification of Tumor-Specific Genes , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[22]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[23]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[24]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[25]  Md. Rezanur Rahman,et al.  Integrative transcriptomics analysis of lung epithelial cells and identification of repurposable drug candidates for COVID-19 , 2020, European Journal of Pharmacology.

[26]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[27]  David S. Wishart,et al.  MetaboAnalyst: a web server for metabolomic data analysis and interpretation , 2009, Nucleic Acids Res..

[28]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[29]  M. Shahjaman,et al.  Robust identification of differentially expressed genes from RNA-seq data. , 2019, Genomics.

[30]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[31]  Wei-Sheng Wu,et al.  Identifying gene regulatory modules of heat shock response in yeast , 2008, BMC Genomics.

[32]  Hsien-Da Huang,et al.  miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions , 2017, Nucleic Acids Res..

[33]  Wen-Hsiung Li,et al.  Systematic identification of yeast cell cycle transcription factors using multiple data sources , 2008, BMC Bioinformatics.

[34]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[35]  Piotr S. Gromski,et al.  Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data , 2014, Metabolites.

[36]  Hanfei Sun,et al.  Target analysis by integration of transcriptome and ChIP-seq data with BETA , 2013, Nature Protocols.

[37]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.

[38]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[39]  Madeleine K. D. Scott,et al.  Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans , 2020, Science.

[40]  M. Sugimoto,et al.  Kernel weighted least square approach for imputing missing values of metabolomics data , 2021, Scientific Reports.

[41]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[42]  R. Nadon,et al.  Statistical issues with microarrays: processing and analysis. , 2002, Trends in genetics : TIG.

[43]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[44]  Md. Nurul Haque Mollah,et al.  Robust Significance Analysis of Microarrays by Minimum β-Divergence Method , 2017, BioMed research international.

[45]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[46]  Kieran J. Sharkey,et al.  A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions , 2013, BMC Systems Biology.

[47]  Fabian J. Theis,et al.  Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data , 2011, BMC Systems Biology.

[48]  L. V. van't Veer,et al.  Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. , 2006, Journal of the National Cancer Institute.

[49]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[50]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[51]  Tero Aittokallio,et al.  Dealing with missing values in large-scale studies: microarray data imputation and beyond , 2010, Briefings Bioinform..