Application of metabolomics to plant genotype discrimination using statistics and machine learning

MOTIVATION Metabolomics is a post genomic technology which seeks to provide a comprehensive profile of all the metabolites present in a biological sample. This complements the mRNA profiles provided by microarrays, and the protein profiles provided by proteomics. To test the power of metabolome analysis we selected the problem of discrimating between related genotypes of Arabidopsis. Specifically, the problem tackled was to discrimate between two background genotypes (Col0 and C24) and, more significantly, the offspring produced by the crossbreeding of these two lines, the progeny (whose genotypes would differ only in their maternally inherited mitichondia and chloroplasts). OVERVIEW A gas chromotography--mass spectrometry (GCMS) profiling protocol was used to identify 433 metabolites in the samples. The metabolomic profiles were compared using descriptive statistics which indicated that key primary metabolites vary more than other metabolites. We then applied neural networks to discriminate between the genotypes. This showed clearly that the two background lines can be discrimated between each other and their progeny, and indicated that the two progeny lines can also be discriminated. We applied Euclidean hierarchical and Principal Component Analysis (PCA) to help understand the basis of genotype discrimination. PCA indicated that malic acid and citrate are the two most important metabolites for discriminating between the background lines, and glucose and fructose are two most important metabolites for discriminating between the crosses. These results are consistant with genotype differences in mitochondia and chloroplasts.

[1]  A. Cornish-Bowden Metabolic Control Analysis in Theory and Practice , 1995 .

[2]  R. Trethewey,et al.  Metabolic profiling: a Rosetta Stone for genomics? , 1999, Current opinion in plant biology.

[3]  D. Dennis,et al.  Plant Physiology, Biochemistry and Molecular Biology , 1990 .

[4]  D. Kell,et al.  A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations , 2001, Nature Biotechnology.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7]  Steffen Schulze-Kremer,et al.  Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology , 1997, ISMB.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Stephen E. Stein,et al.  An Integrated Method for Spectrum Extraction and Compound Identification from GC/MS Data , 1999 .

[10]  López del Val Ja,et al.  Principal components analysis , 1993 .

[11]  D. M. Greenberg,et al.  Energetics, tricarboxylic acid cycle, and carbohydrates , 1967 .

[12]  P Mendes,et al.  Biochemistry by numbers: simulation of biochemical pathways with Gepasi 3. , 1997, Trends in biochemical sciences.

[13]  M. Adams,et al.  Simultaneous determination by capillary gas chromatography of organic acids, sugars, and sugar alcohols in plant tissue extracts as their trimethylsilyl derivatives. , 1999, Analytical biochemistry.

[14]  Douglas B. Kell,et al.  Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation , 1998, Bioinform..

[15]  Brian Everitt,et al.  Cluster analysis , 1974 .

[16]  P. Cohen,et al.  Control of Enzyme Activity , 1976 .

[17]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[18]  Ashwin Srinivasan,et al.  Biochemical Knowledge Discovery Using Inductive Logic Programming , 1998, Discovery Science.

[19]  B. Buchanan The ferredoxin/thioredoxin system: a key element in the regulatory function of light in photosynthesis. , 1984, Bioscience.

[20]  I. Molnár-Perl,et al.  Simultaneous determination of sugars, sugar alcohols, acids and amino acids in apricots by gas chromatography–mass spectrometry , 1999 .

[21]  O. Fiehn,et al.  Identification of uncommon plant metabolites based on calculation of elemental compositions using gas chromatography and quadrupole mass spectrometry. , 2000, Analytical chemistry.

[22]  P Dupree,et al.  Use of a proteome strategy for tagging proteins present at the plasma membrane. , 1998, The Plant journal : for cell and molecular biology.

[23]  Y. Ruan,et al.  Towards Arabidopsis genome analysis: monitoring expression profiles of 1400 genes using cDNA microarrays. , 1998, The Plant journal : for cell and molecular biology.

[24]  O. Fiehn,et al.  Metabolite profiling for plant functional genomics , 2000, Nature Biotechnology.

[25]  Kevin A. Pyke Arabidopsis: annual plant reviews, volume 1. (Ed. by MARY ANDERSON and JEREMY A. ROBERTS.) 24×16 cm. Pp. 407. Sheffield, UK: Sheffield Academic Press. Price h/b: £95.00. ISBN 1 85075 8905. , 1999 .

[26]  D. Baldwin,et al.  A comparison of gel-based, nylon filter and microarray techniques to detect differential RNA expression in plants. , 1999, Current opinion in plant biology.

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .