Using Expert Driven Machine Learning to Enhance Dynamic Metabolomics Data Analysis

Data analysis for metabolomics is undergoing rapid progress thanks to the proliferation of novel tools and the standardization of existing workflows. As untargeted metabolomics datasets and experiments continue to increase in size and complexity, standardized workflows are often not sufficiently sophisticated. In addition, the ground truth for untargeted metabolomics experiments is intrinsically unknown and the performance of tools is difficult to evaluate. Here, the problem of dynamic multi-class metabolomics experiments was investigated using a simulated dataset with a known ground truth. This simulated dataset was used to evaluate the performance of tinderesting, a new and intuitive tool based on gathering expert knowledge to be used in machine learning. The results were compared to EDGE, a statistical method for time series data. This paper presents three novel outcomes. The first is a way to simulate dynamic metabolomics data with a known ground truth based on ordinary differential equations. This method is made available through the MetaboLouise R package. Second, the EDGE tool, originally developed for genomics data analysis, is highly performant in analyzing dynamic case vs. control metabolomics data. Third, the tinderesting method is introduced to analyse more complex dynamic metabolomics experiments. This tool consists of a Shiny app for collecting expert knowledge, which in turn is used to train a machine learning model to emulate the decision process of the expert. This approach does not replace traditional data analysis workflows for metabolomics, but can provide additional information, improved performance or easier interpretation of results. The advantage is that the tool is agnostic to the complexity of the experiment, and thus is easier to use in advanced setups. All code for the presented analysis, MetaboLouise and tinderesting are freely available.

[1]  A. K. Smilde,et al.  Dynamic metabolomic data analysis: a tutorial review , 2009, Metabolomics.

[2]  Jasmine Chong,et al.  MetaboAnalystR: an R package for flexible and reproducible analysis of metabolomics data , 2018, Bioinform..

[3]  K. Laukens,et al.  Revelation of the metabolic pathway of hederacoside C using an innovative data analysis strategy for dynamic multiclass biotransformation experiments. , 2019, Journal of chromatography. A.

[4]  Christoph Steinbeck,et al.  Computational tools and workflows in metabolomics: An international survey highlights the opportunity for harmonisation through Galaxy , 2016, Metabolomics.

[5]  Jeffrey T. Leek,et al.  Gene expression EDGE : extraction and analysis of differential gene expression , 2006 .

[6]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. Salmon,et al.  Pharmacokinetics of aspirin and salicylate in relation to inhibition of arachidonate cyclooxygenase and antiinflammatory activity. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. F. Stevens,et al.  The chemistry of gut microbial metabolism of polyphenols , 2016, Phytochemistry Reviews.

[9]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. S. Venkatraman,et al.  A distribution-free procedure for comparing receiver operating characteristic curves from a paired experiment , 1996 .

[11]  David S. Wishart,et al.  MetaboAnalyst 4.0: towards more transparent and integrative metabolomics analysis , 2018, Nucleic Acids Res..

[12]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[13]  Sebastian Bonhoeffer,et al.  The Evolution of Connectivity in Metabolic Networks , 2005, PLoS biology.

[14]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[15]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[18]  M. Claeys,et al.  Development and Validation of an in vitro Experimental GastroIntestinal Dialysis Model with Colon Phase to Study the Availability and Colonic Metabolisation of Polyphenolic Compounds , 2015, Planta Medica.

[19]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[20]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[21]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .