Statistical measures for validating plant genotype similarity assessments following multivariate analysis of metabolome fingerprint data

Metabolome fingerprinting offers opportunities for ‘first pass’ evaluation of compositional similarity between plant genotypes. Compositional “substantial equivalence” testing is a popular concept in the literature in relation to food safety; however reported studies do not provide a systematic and standard approach to quantify similarity in a high dimensional data context. We have undertaken a large scale screen of Arabidopsis genotypes for evidence that individual genetic modifications effect plant phenotype at the level of the metabolome. From this study we propose pragmatic alternative measures that could in the future be used to assess substantial equivalence in GM foods under realistic data paucity constraints and without prior feature selection. Evaluation of classifier accuracy in supervised data mining approaches by bootstrap error estimation provided a robust tool for model validation. Receiver operating characteristics (such as AUC) provide an alternative measure of predictive ability by displaying the relationship between sensitivity and specificity. Additional specific measures based on scatter matrices and sample margins have also been investigated. We illustrate the application of such metrics on a large metabolic profiling data set derived from analysis of 27 genetically distinct Arabidopsisthaliana mutants. We show that agreement exists between model margins, eigenvalue, accuracies and AUC characteristics produced by three different classifiers (Random Forest, Support Vector Machine and Linear Discriminant Analysis). Comparisons between mutants with no observable phenotypic differences to the parent ecotype provided a baseline for model significance metrics; whilst comparison of mutants with increasingly distinct phenotypic alterations generated predictable changes in these measures of similarity.

[1]  Andreas Quandt,et al.  Finding regions of significance in SELDI measurements for identifying protein biomarkers , 2006, Bioinform..

[2]  M. Bizzarri,et al.  NMR-based metabonomic study of transgenic maize. , 2004, Phytochemistry.

[3]  A. Lovegrove,et al.  A metabolomic study of substantial equivalence of field-grown genetically modified wheat. , 2006, Plant biotechnology journal.

[4]  Ulisses Braga-Neto,et al.  Exact performance of error estimators for discrete classifiers , 2005, Pattern Recognit..

[5]  C. Manetti,et al.  A metabonomic study of transgenic maize (Zea mays) seeds revealed variations in osmolytes and branched amino acids. , 2006, Journal of experimental botany.

[6]  Ramón Díaz-Uriarte,et al.  Supervised Methods with Genomic Data: a Review and Cautionary View , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[7]  E. Fukusaki,et al.  Plant metabolomics: potential for practical operation. , 2005, Journal of bioscience and bioengineering.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[10]  H. Kuiper,et al.  Assessment of the food safety issues related to genetically modified foods. , 2001, The Plant journal : for cell and molecular biology.

[11]  K. Lowe,et al.  Metabolite fingerprinting in transgenic lettuce. , 2005, Plant biotechnology journal.

[12]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[13]  Sameer Singh,et al.  Multiresolution Estimates of Classification Complexity , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[15]  John Draper,et al.  Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals , 2006, Proceedings of the National Academy of Sciences.

[16]  Werner Dubitzky,et al.  Erratum: Avoiding model selection bias in small-sample genomic datasets (Bioinformatics (2006) vol. 22 (10) (1245-1250)) , 2006 .

[17]  Andrew Cockburn,et al.  Assuring the safety of genetically modified (GM) foods: the importance of an holistic, integrative approach. , 2002, Journal of biotechnology.

[18]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[19]  David R. Bickel,et al.  Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes , 2004, Bioinform..

[20]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[21]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[22]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Milos Hauskrecht,et al.  ORIGINAL RESEARCH Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive , 2022 .

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[26]  Carlos E. Thomaz,et al.  Using a Maximum Uncertainty LDA-Based Approach to Classify and Analyse MR Brain Images , 2004, MICCAI.

[27]  Terry Windeatt,et al.  Vote counting measures for ensemble classifiers , 2003, Pattern Recognit..

[28]  Sarah Oehlschlager,et al.  NMR profiling of transgenic peas. , 2004, Plant biotechnology journal.

[29]  Hyung-Kyoon Choi,et al.  Metabolic fingerprinting of wild type and transgenic tobacco plants by 1H NMR and multivariate analysis technique. , 2004, Phytochemistry.

[30]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[31]  M. Taylor,et al.  Assessing the potential for unintended effects in genetically modified potatoes perturbed in metabolic and developmental processes. Targeted analysis of key nutrients and anti-nutrients , 2006, Transgenic Research.

[32]  I Kimber,et al.  Assessment of the safety of foods derived from genetically modified (GM) crops. , 2004, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[33]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[34]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[35]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[36]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[37]  A. Segre Line Width of Nuclear Magnetic Resonance High Resolution Spectra of Vinyl Polymers , 1968 .

[38]  A. Segre,et al.  Nuclear Magnetic Resonance Spectroscopy-Based Metabolite Profiling of Transgenic Tomato Fruit Engineered to Accumulate Spermidine and Spermine Reveals Enhanced Anabolic and Nitrogen-Carbon Interactions1[W][OA] , 2006, Plant Physiology.

[39]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.

[40]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[41]  Jian Yang,et al.  Why can LDA be performed in PCA transformed space? , 2003, Pattern Recognit..

[42]  Jian Yang,et al.  Feature fusion: parallel strategy vs. serial strategy , 2003, Pattern Recognit..

[43]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[44]  Nigel W. Hardy,et al.  Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[45]  H. Kuiper,et al.  Exploitation of molecular profiling techniques for GM food safety assessment. , 2003, Current opinion in biotechnology.

[46]  Esther J Kok,et al.  Substantial equivalence--an appropriate paradigm for the safety assessment of genetically modified foods? , 2002, Toxicology.

[47]  G. Le Gall,et al.  Metabolite profiling of tomato (Lycopersicon esculentum) using 1H NMR spectroscopy as a tool to detect potential unintended effects following a genetic modification. , 2003, Journal of agricultural and food chemistry.