Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

Powerful algorithms are required to deal with the dimensionality of metabolomics data. Although many achieve high classification accuracy, the models they generate have limited value unless it can be demonstrated that they are reproducible and statistically relevant to the biological problem under investigation. Random forest (RF) generates models, without any requirement for dimensionality reduction or feature selection, in which individual variables are ranked for significance and displayed in an explicit manner. In metabolome fingerprinting by mass spectrometry, each metabolite can be represented by signals at several m/z. Exploiting a prior understanding of expected biochemical differences between sample classes, we aimed to develop meaningful metrics relevant to the significance both of the overall RF model and individual, potentially explanatory, signals. Pair-wise comparison of related plant genotypes with strong phenotypic differences demonstrated that robust models are not only reproducible but also logically structured, highlighting correlated m/z derived from just a small number of explanatory metabolites reflecting the biological differences between sample classes. RF models were also generated by using groupings of samples known to be increasingly phenotypically similar. Although classification accuracy was often reasonable, we demonstrated reproducibly in both Arabidopsis and potato a performance threshold based on margin statistics beyond which such models showed little structure indicative of either generalizibility or further biological interpretability. In a multiclass problem using 25 Arabidopsis genotypes, despite the complicating effects of ecotype background and secondary metabolome perturbations common to several mutations, the ranking of metabolome signals by RF provided scope for deeper interpretability.

[1]  Jinyan Li,et al.  Twelve C2H2 zinc-finger genes on human chromosome 19 can be each translated into the same type of protein after frameshifts , 2004, Bioinform..

[2]  W. Dunn,et al.  Measuring the metabolome: current analytical technologies. , 2005, The Analyst.

[3]  Kazuki Saito,et al.  Potential of metabolomics as a functional genomics tool. , 2004, Trends in plant science.

[4]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[5]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[6]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Nigel W. Hardy,et al.  Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[11]  Christian Böhm,et al.  Modelling of classification rules on metabolic patterns including machine learning and expert knowledge , 2005, J. Biomed. Informatics.

[12]  Milos Hauskrecht,et al.  ORIGINAL RESEARCH Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive , 2022 .

[13]  Jingyuan Fu,et al.  The genetics of plant metabolism , 2006, Nature Genetics.

[14]  O. Fiehn Metabolomics – the link between genotypes and phenotypes , 2004, Plant Molecular Biology.

[15]  D B Kell,et al.  Genomic computing. Explanatory analysis of plant expression profiling data using machine learning. , 2001, Plant physiology.

[16]  D. F. Morrison,et al.  Multivariate Statistical Methods , 1968 .

[17]  O. Fiehn,et al.  Metabolite profiling for plant functional genomics , 2000, Nature Biotechnology.

[18]  Herman Höfte,et al.  Classification and identification of Arabidopsis cell wall mutants using Fourier-Transform InfraRed (FT-IR) microspectroscopy. , 2003, The Plant journal : for cell and molecular biology.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Ute Roessner,et al.  Metabolic Profiling Allows Comprehensive Phenotyping of Genetically or Environmentally Modified Plant Systems , 2001, Plant Cell.

[21]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[22]  L. Willmitzer,et al.  Transgenic potato (Solanum tuberosum) tubers synthesize the full spectrum of inulin molecules naturally occurring in globe artichoke (Cynara scolymus) roots. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[24]  D. Goodenowe,et al.  Nontargeted metabolome analysis by use of Fourier Transform Ion Cyclotron Mass Spectrometry. , 2002, Omics : a journal of integrative biology.

[25]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[26]  O. Fiehn,et al.  Differential metabolic networks unravel the effects of silent plant phenotypes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  D. Kell,et al.  High-throughput classification of yeast mutants for functional genomics using metabolic footprinting , 2003, Nature Biotechnology.

[28]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[29]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[30]  J. Ward,et al.  Assessment of 1H NMR spectroscopy and multivariate analysis as a technique for metabolite fingerprinting of Arabidopsis thaliana. , 2003, Phytochemistry.

[31]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[32]  W. Dunn,et al.  Evaluation of automated electrospray-TOF mass spectrometryfor metabolic fingerprinting of the plant metabolome , 2005, Metabolomics.

[33]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[34]  Mariusz Kowalczyk,et al.  A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. , 2004, Analytical chemistry.

[35]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.

[36]  George Stephanopoulos,et al.  Identification of optimal classification functions for biological sample and state discrimination from metabolic profiling data , 2004, Bioinform..