GMSimpute: a generalized two-step Lasso approach to impute missing values in label-free mass spectrum analysis

Abstract Motivation Missingness in label-free mass spectrometry is inherent to the technology. A computational approach to recover missing values in metabolomics and proteomics datasets is important. Most existing methods are designed under a particular assumption, either missing at random or under the detection limit. If the missing pattern deviates from the assumption, it may lead to biased results. Hence, we investigate the missing patterns in free mass spectrometry data and develop an omnibus approach GMSimpute, to allow effective imputation accommodating different missing patterns. Results Three proteomics datasets and one metabolomics dataset indicate missing values could be a mixture of abundance-dependent and abundance-independent missingness. We assess the performance of GMSimpute using simulated data (with a wide range of 80 missing patterns) and metabolomics data from the Cancer Genome Atlas breast cancer and clear cell renal cell carcinoma studies. Using Pearson correlation and normalized root mean square errors between the true and imputed abundance, we compare its performance to K-nearest neighbors’ type approaches, Random Forest, GSimp, a model-based method implemented in DanteR and minimum values. The results indicate GMSimpute provides higher accuracy in imputation and exhibits stable performance across different missing patterns. In addition, GMSimpute is able to identify the features in downstream differential expression analysis with high accuracy when applied to the Cancer Genome Atlas datasets. Availability and implementation GMSimpute is on CRAN: https://cran.r-project.org/web/packages/GMSimpute/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2]  Marti A. Hearst Intelligent Connections: Battling with GA-Joe. , 1998 .

[3]  E. Bonifacio,et al.  Age- and Islet Autoimmunity–Associated Differences in Amino Acid and Lipid Metabolites in Children at Risk for Type 1 Diabetes , 2011, Diabetes.

[4]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[5]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Shuzhao Li,et al.  Detailed Investigation and Comparison of the XCMS and MZmine 2 Chromatogram Construction and Chromatographic Peak Detection Methods for Preprocessing Mass Spectrometry Metabolomics Data. , 2017, Analytical chemistry.

[8]  Coral Barbas,et al.  Missing value imputation strategies for metabolomics data , 2015, Electrophoresis.

[9]  Zhuxuan Jin,et al.  Missing value imputation for LC-MS metabolomics data by incorporating metabolic network and adduct ion relations , 2018, Bioinform..

[10]  D. Basak,et al.  Support Vector Regression , 2008 .

[11]  Runmin Wei,et al.  Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data , 2017, bioRxiv.

[12]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[13]  Jennifer A Kirwan,et al.  Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control , 2014, Scientific Data.

[14]  Jüergen Cox,et al.  The MaxQuant computational platform for mass spectrometry-based shotgun proteomics , 2016, Nature Protocols.

[15]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[16]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[17]  Tianwei Yu,et al.  Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach , 2014, Bioinform..

[18]  Chris Sander,et al.  An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma. , 2016, Cancer cell.

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  A. McCullough Comprehensive molecular characterization of human colon and rectal cancer , 2013 .

[21]  Raymond J. Carroll,et al.  Application of survival analysis methodology to the quantitative analysis of LC-MS proteomics data , 2011, 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS).

[22]  E. Iversen,et al.  A joint analysis of metabolomics and genetics of breast cancer , 2014, Breast Cancer Research.

[23]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[24]  Brendan MacLean,et al.  Bioinformatics Applications Note Gene Expression Skyline: an Open Source Document Editor for Creating and Analyzing Targeted Proteomics Experiments , 2022 .

[25]  J. Ilonen,et al.  Cord Serum Lipidome in Prediction of Islet Autoimmunity and Type 1 Diabetes , 2013, Diabetes.

[26]  Guangji Wang,et al.  Metabolomics-Proteomics Combined Approach Identifies Differential Metabolism-Associated Molecular Events between Senescence and Apoptosis. , 2017, Journal of proteome research.

[27]  U. Rix,et al.  Evaluating kinase ATP uptake and tyrosine phosphorylation using multiplexed quantification of chemically labeled and post-translationally modified peptides. , 2015, Methods.

[28]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[29]  Runmin Wei,et al.  GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies , 2017, bioRxiv.

[30]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[31]  Guy N. Brock,et al.  Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies , 2017, BMC Bioinformatics.

[32]  Bin Fang,et al.  Adaptive responses to dasatinib-treated lung squamous cell cancer cells harboring DDR2 mutations. , 2014, Cancer research.

[33]  Richard D. Smith,et al.  DanteR: an extensible R-based tool for quantitative analysis of -omics data , 2012, Bioinform..

[34]  J. Foekens,et al.  4‐protein signature predicting tamoxifen treatment outcome in recurrent breast cancer , 2015, Molecular oncology.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.