A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models. We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis. We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks. There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice. The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.

[1]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.

[2]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[3]  Carsten Denkert,et al.  Monounsaturated fatty acids in serum triacylglycerols are associated with response to neoadjuvant chemotherapy in breast cancer patients , 2014, International journal of cancer.

[4]  H. Wold Path Models with Latent Variables: The NIPALS Approach , 1975 .

[5]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[6]  Christoph Steinbeck,et al.  MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data , 2012, Nucleic Acids Res..

[7]  David Bensimon,et al.  Some nonlinear challenges in biology , 2008 .

[8]  John A. Bowden,et al.  International Ring Trial of a High Resolution Targeted Metabolomics and Lipidomics Platform for Serum and Plasma Analysis. , 2019, Analytical chemistry.

[9]  R. Abagyan,et al.  METLIN: A Metabolite Mass Spectral Database , 2005, Therapeutic drug monitoring.

[10]  David I. Ellis,et al.  A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding. , 2015, Analytica chimica acta.

[11]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Lynne Boddy,et al.  A comparison of Radial Basis Function and backpropagation neural networks for identification of marine phytoplankton from multivariate flow cytometry data , 1994, Comput. Appl. Biosci..

[14]  S. Menard Applied Logistic Regression Analysis , 1996 .

[15]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[16]  Ian D. Wilson,et al.  Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies , 2018, Metabolomics : Official journal of the Metabolomic Society.

[17]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[18]  C. Huttenhower,et al.  Gut microbiome structure and metabolic activity in inflammatory bowel disease , 2018, Nature Microbiology.

[19]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[22]  B. Efron The Bootstrap and Modern Statistics , 2000 .

[23]  I. Jolliffe A Note on the Use of Principal Components in Regression , 1982 .

[24]  Y. Pawitan,et al.  Large-scale non-targeted metabolomic profiling in three human population-based studies , 2014, bioRxiv.

[25]  Krista A. Zanetti,et al.  The Consortium of Metabolomics Studies (COMETS): Metabolomics in 47 Prospective Cohort Studies. , 2019, American journal of epidemiology.

[26]  Kosuke Imai,et al.  mediation: R Package for Causal Mediation Analysis , 2014 .

[27]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Defa Li,et al.  Bile acid is a significant host factor shaping the gut microbiome of diet-induced obese mice , 2017, BMC Biology.

[30]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[31]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[32]  Anders Larsson,et al.  Large-scale Metabolomic Profiling Identifies Novel Biomarkers for Incident Coronary Heart Disease , 2014, PLoS genetics.

[33]  Leighton Pritchard,et al.  Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing , 2019, Metabolomics.

[34]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[35]  E. Thévenot,et al.  Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses. , 2015, Journal of proteome research.

[36]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[37]  Large-scale non-targeted metabolomic profiling in three human population-based studies , 2014, bioRxiv.

[38]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[39]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[40]  Padhraic Smyth,et al.  Science and data science , 2017, Proceedings of the National Academy of Sciences.

[41]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[42]  Brian Vinter,et al.  Numerical Python for scalable architectures , 2010, PGAS '10.

[43]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[44]  Kevin M. Mendez,et al.  The application of artificial neural networks in metabolomics: a historical perspective , 2019, Metabolomics.

[45]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[46]  D. Wishart,et al.  Translational biomarker discovery in clinical metabolomics: an introductory tutorial , 2012, Metabolomics.

[47]  S. Gapstur,et al.  Serum metabolomic profiles associated with postmenopausal hormone use , 2018, Metabolomics.

[48]  Jeremy Fairbank,et al.  Historical Perspective , 1987, Do We Really Understand Quantum Mechanics?.

[49]  E. Fukusaki,et al.  Distinct signatures of dental plaque metabolic byproducts dictated by periodontal inflammatory status , 2017, Scientific Reports.

[50]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[51]  I. Jolliffe Principal Component Analysis , 2002 .

[52]  Waldemar Rebizant,et al.  Application of Artificial Neural Networks , 2011 .

[53]  S. Wold,et al.  PLS: Partial Least Squares Projections to Latent Structures , 1993 .

[54]  Eoin Fahy,et al.  Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools , 2015, Nucleic Acids Res..

[55]  Royston Goodacre,et al.  Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. , 2011, Chemical Society reviews.

[56]  Oliver Fiehn,et al.  Investigation of Metabolomic Blood Biomarkers for Detection of Adenocarcinoma Lung Cancer , 2015, Cancer Epidemiology, Biomarkers & Prevention.

[57]  D. P. Lewis,et al.  Metabolic profiling reveals anomalous energy metabolism and oxidative stress pathways in chronic fatigue syndrome patients , 2015, Metabolomics.