Species Identification Using Partial DNA Sequence: A Machine Learning Approach

Species identification with partial DNA sequences has proved effective for different organisms. DNA barcode is a short genetic marker in an organism's DNA to identify which species it belongs to. In this work, we analyze the effectiveness of supervised machine learning methods to classify species with DNA barcode. We choose specimens from phylogenetically diverse species belonging to the animal, plant and fungus kingdoms. We consider the supervised machine learning methods, simple logistic function, random forest, PART, instance-based k-nearest neighbor, attribute-based classifier, and bagging. The analysis of results on various datasets shows that the classification performances of the selected methods are encouraging, and has an accuracy of 93.66% on average. This result shows 6% improvement compared to the state-of-art DNA barcode classification methods, which have 88.37% accuracy on average.

[1]  P. Hebert,et al.  Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[2]  C. Cunningham,et al.  Using DNA to assess errors in tropical tree identifications: How often are ecologists wrong and when does it matter? , 2010 .

[3]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[4]  Dong Liang,et al.  PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region , 2011, BMC Bioinformatics.

[5]  Jeremy R. deWaard,et al.  Biological identifications through DNA barcodes , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[6]  Wouter Boomsma,et al.  Statistical assignment of DNA sequences using Bayesian phylogenetics. , 2008, Systematic biology.

[7]  Giovanni Felici,et al.  DNA Barcoding of Recently Diverged Species: Relative Performance of Matching Methods , 2012, PloS one.

[8]  D. Hickey,et al.  The DNA Barcode Linker , 2011, Molecular ecology resources.

[9]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[10]  D. Tautz,et al.  A plea for DNA taxonomy , 2003 .

[11]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[12]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[13]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[14]  W. John Kress,et al.  A DNA barcode for land plants , 2009, Proceedings of the National Academy of Sciences.

[15]  Vladimir Pavlovic,et al.  Efficient alignment-free DNA barcode analytics , 2009, BMC Bioinformatics.

[16]  Willem Waegeman,et al.  Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning. , 2011, Systematic and applied microbiology.

[17]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[18]  Damon P. Little,et al.  DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability , 2011, PloS one.

[19]  Indra Neil Sarkar,et al.  caos software for use in character‐based DNA barcoding , 2008, Molecular ecology resources.

[20]  J. Farris Estimating Phylogenetic Trees from Distance Matrices , 1972, The American Naturalist.

[21]  Giovanni Felici,et al.  Learning to classify species with barcodes , 2009, BMC Bioinformatics.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Gaurav Vaidya,et al.  DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. , 2006, Systematic biology.

[24]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[25]  G. Brian Golding,et al.  Assigning sequences to species in the absence of large interspecific differences. , 2010, Molecular phylogenetics and evolution.

[26]  P. Bertolazzi,et al.  BLOG 2.0: a software system for character‐based species classification with DNA Barcode sequences. What it does, how to use it , 2013, Molecular ecology resources.

[27]  P. Hebert,et al.  The promise of DNA barcoding for taxonomy. , 2005, Systematic biology.

[28]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[29]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[30]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[31]  Qingshan Jiang,et al.  A new method for classification in DNA sequence , 2011, 2011 6th International Conference on Computer Science & Education (ICCSE).

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  P. Hebert,et al.  Identification of Birds through DNA Barcodes , 2004, PLoS biology.

[34]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[35]  Olivier David,et al.  DNA barcode analysis: a comparison of phylogenetic and statistical classification methods , 2009, BMC Bioinformatics.

[36]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[37]  Kishori M. Konwar,et al.  DNA-BAR: distinguisher selection for DNA barcoding , 2005, Bioinform..