Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  R. Bro,et al.  Centering and scaling in component analysis , 2003 .

[3]  D. Ekman,et al.  A direct cell quenching method for cell-culture based metabolomics , 2009, Metabolomics.

[4]  H. Macfie,et al.  Use of canonical variates analysis in differentiation of bacteria by pyrolysis gas-liquid chromatography. , 1978, Journal of general microbiology.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  J. Selbig,et al.  Metabolite profile analysis: from raw data to regression and classification. , 2007, Physiologia plantarum.

[7]  D. Kell,et al.  Metabolomics by numbers: acquiring and understanding global metabolite data. , 2004, Trends in biotechnology.

[8]  D. Massart,et al.  Dealing with missing data , 2001 .

[9]  Oliver Fiehn,et al.  Combining Genomics, Metabolome Analysis, and Biochemical Modelling to Understand Metabolic Networks , 2001, Comparative and functional genomics.

[10]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[11]  David I. Ellis,et al.  A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. , 2014, Analytica chimica acta.

[12]  R. Brereton,et al.  Partial least squares discriminant analysis: taking the magic away , 2014 .

[13]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[14]  Gábor J. Székely,et al.  Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method , 2005, J. Classif..

[15]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[16]  Yves Gibon,et al.  GMD@CSB.DB: the Golm Metabolome Database , 2005, Bioinform..

[17]  Joachim Selbig,et al.  A gentle guide to the analysis of metabolomic data. , 2007, Methods in molecular biology.

[18]  D B Kell,et al.  Rapid identification of urinary tract infection bacteria using hyperspectral whole-organism fingerprinting and artificial neural networks. , 1998, Microbiology.

[19]  R. Brereton Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data , 2006 .

[20]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Brian Everitt,et al.  Cluster analysis , 1974 .

[22]  S. Fischer,et al.  Global LC/MS Metabolomics Profiling of Calcium Stressed and Immunosuppressant Drug Treated Saccharomyces cerevisiae , 2013, Metabolites.

[23]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[24]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[25]  Steven D. Brown Introduction to Multivariate Statistical Analysis in Chemometrics , 2010 .

[26]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[27]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[28]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[29]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[30]  R. Goodacre,et al.  Systems biology of chemotherapy in hypoxia environments , 2012 .

[31]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[32]  Rolph E. Anderson,et al.  Multivariate Data Analysis (7th ed. , 2009 .

[33]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[35]  D. Stekhoven missForest: Nonparametric missing value imputation using random forest , 2015 .

[36]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[37]  Roland Wilson,et al.  Unsupervised learning and clustering using a random field approach , 2007 .

[38]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[39]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[40]  A. Kaplan,et al.  A Beginner's Guide to Partial Least Squares Analysis , 2004 .

[41]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[44]  Bryan F. J. Manly,et al.  Multivariate Statistical Methods : A Primer , 1986 .

[45]  J. W. Allwood,et al.  Is serum or plasma more appropriate for intersubject comparisons in metabolomic studies? An assessment in patients with small-cell lung cancer. , 2011, Analytical chemistry.

[46]  S. de Jong,et al.  A framework for sequential multiblock component methods , 2003 .

[47]  Joshua D. Knowles,et al.  Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry , 2011, Nature Protocols.

[48]  Joshua D. Knowles,et al.  Development and performance of a gas chromatography-time-of-flight mass spectrometry analysis for large-scale nontargeted metabolomic studies of human serum. , 2009, Analytical chemistry.

[49]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[50]  W. Dixon BMD : biomedical computer programs , 1967 .

[51]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[52]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[53]  David S. Wishart,et al.  MetaboAnalyst: a web server for metabolomic data analysis and interpretation , 2009, Nucleic Acids Res..

[54]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[55]  Royston Goodacre,et al.  Multiblock principal component analysis: an efficient tool for analyzing metabolomics data which contain two influential factors , 2011, Metabolomics.

[56]  Douglas B. Kell,et al.  Proposed minimum reporting standards for data analysis in metabolomics , 2007, Metabolomics.

[57]  J. C. van Houwelingen,et al.  An Application of Factor Analysis With Missing Data , 1981 .

[58]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[59]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[60]  Kieran J. Sharkey,et al.  A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions , 2013, BMC Systems Biology.

[61]  Royston Goodacre,et al.  Assessment of Adaptive Focused Acoustics versus Manual Vortex/freeze-thaw for Intracellular Metabolite Extraction from Streptomyces Lividans Producing Recombinant Proteins Using Gc-ms and Multi-block Principal Component Analysis , 2009 .

[62]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[63]  Jian Yang,et al.  Metabolomics spectral formatting, alignment and conversion tools (MSFACTs) , 2003, Bioinform..

[64]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[65]  Royston Goodacre,et al.  Metabolic footprinting as a tool for discriminating between brewing yeasts , 2007, Yeast.

[66]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[67]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[68]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..