Supervised DNA Barcodes species classification: analysis, comparisons and results

BackgroundSpecific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.MethodsIn this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.ResultsA software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.ConclusionsThe classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.

[1]  Gaurav Vaidya,et al.  DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. , 2006, Systematic biology.

[2]  C. Bonferroni Il calcolo delle assicurazioni su gruppi di teste , 1935 .

[3]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[4]  J. Farris Estimating Phylogenetic Trees from Distance Matrices , 1972, The American Naturalist.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Giovanni Felici,et al.  Learning to classify species with barcodes , 2009, BMC Bioinformatics.

[7]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[9]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[10]  Wouter Boomsma,et al.  Statistical assignment of DNA sequences using Bayesian phylogenetics. , 2008, Systematic biology.

[11]  Kishori M. Konwar,et al.  DNA-BAR: distinguisher selection for DNA barcoding , 2005, Bioinform..

[12]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[13]  Damon P. Little,et al.  DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability , 2011, PloS one.

[14]  Indra Neil Sarkar,et al.  caos software for use in character‐based DNA barcoding , 2008, Molecular ecology resources.

[15]  Dong Liang,et al.  PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region , 2011, BMC Bioinformatics.

[16]  N. Baeshen,et al.  Biological Identifications Through DNA Barcodes , 2012 .

[17]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[18]  P. Hebert,et al.  The promise of DNA barcoding for taxonomy. , 2005, Systematic biology.

[19]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[20]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[21]  Vladimir Pavlovic,et al.  Efficient alignment-free DNA barcode analytics , 2009, BMC Bioinformatics.

[22]  Giovanni Felici,et al.  DNA Barcoding of Recently Diverged Species: Relative Performance of Matching Methods , 2012, PloS one.

[23]  Jing Yuan,et al.  Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies , 2009, BioData Mining.

[24]  F. Wilcoxon,et al.  Probability tables for individual comparisons by ranking methods. , 1947, Biometrics.

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  G. Brian Golding,et al.  Assigning sequences to species in the absence of large interspecific differences. , 2010, Molecular phylogenetics and evolution.

[27]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[28]  D. Hickey,et al.  The DNA Barcode Linker , 2011, Molecular ecology resources.

[29]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[30]  P. Bertolazzi,et al.  BLOG 2.0: a software system for character‐based species classification with DNA Barcode sequences. What it does, how to use it , 2013, Molecular ecology resources.

[31]  P. Hebert,et al.  Identification of Birds through DNA Barcodes , 2004, PLoS biology.

[32]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[33]  Olivier David,et al.  DNA barcode analysis: a comparison of phylogenetic and statistical classification methods , 2009, BMC Bioinformatics.

[34]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[35]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[36]  P. Hebert,et al.  Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[37]  C. Cunningham,et al.  Using DNA to assess errors in tropical tree identifications: How often are ecologists wrong and when does it matter? , 2010 .

[38]  Laurent Keller,et al.  Conflict over Male Parentage in Social Insects , 2004, PLoS biology.

[39]  W. John Kress,et al.  A DNA barcode for land plants , 2009, Proceedings of the National Academy of Sciences.