Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student’s t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics (https://metabolomics.cc.hawaii.edu/software/MetImp/).

[1]  Yan Ni,et al.  ADAP-GC 3.0: Improved Peak Detection and Deconvolution of Co-eluting Metabolites from GC/TOF-MS Data for Metabolomics Studies. , 2016, Analytical chemistry.

[2]  Coral Barbas,et al.  Missing value imputation strategies for metabolomics data , 2015, Electrophoresis.

[3]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[4]  Kyoungmi Kim,et al.  Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies , 2013, Statistical applications in genetics and molecular biology.

[5]  Tytus D. Mak,et al.  MetaboLyzer: a novel statistical workflow for analyzing Postprocessed LC-MS metabolomics data. , 2014, Analytical chemistry.

[6]  Alexander Goesmann,et al.  MeltDB 2.0–advances of the metabolomics software system , 2013, Bioinform..

[7]  Matej Oresic,et al.  MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data , 2006, Bioinform..

[8]  E. Thévenot,et al.  Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses. , 2015, Journal of proteome research.

[9]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[10]  Xin Lu,et al.  A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis , 2015, Front. Mol. Biosci..

[11]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[12]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[13]  Y. Bao,et al.  The ratio of dihomo‐γ‐linolenic acid to deoxycholic acid species is a potential biomarker for the metabolic abnormalities in obesity , 2017, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[14]  David S. Wishart,et al.  MetaboAnalyst 3.0—making metabolomics more meaningful , 2015, Nucleic Acids Res..

[15]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[16]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[17]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.

[18]  Joachim Kopka,et al.  TagFinder: preprocessing software for the fingerprinting and the profiling of gas chromatography-mass spectrometry based metabolome analyses. , 2012, Methods in molecular biology.

[19]  David S. Wishart,et al.  MetaboAnalyst: a web server for metabolomic data analysis and interpretation , 2009, Nucleic Acids Res..

[20]  Yurii B. Shvetsov,et al.  Circulating Unsaturated Fatty Acids Delineate the Metabolic Status of Obese Individuals , 2015, EBioMedicine.

[21]  Jasper Engel,et al.  Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling , 2016, Metabolomics.

[22]  Piotr S. Gromski,et al.  Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data , 2014, Metabolites.

[23]  Laurent Gatto,et al.  Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. , 2016, Journal of proteome research.

[24]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[25]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[26]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[27]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[28]  Xiang Zhan,et al.  Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data , 2015, BMC Bioinformatics.

[29]  Ping Liu,et al.  Profiling of serum bile acids in a healthy Chinese population using UPLC-MS/MS. , 2015, Journal of proteome research.

[30]  T. Huan,et al.  Counting missing values in a metabolite-intensity data set for measuring the analytical performance of a metabolomics platform. , 2015, Analytical chemistry.

[31]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[32]  B. Hammock,et al.  Mass spectrometry-based metabolomics. , 2007, Mass spectrometry reviews.