Efficient alignment-free DNA barcode analytics

BackgroundIn this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups.ResultsNew alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods.ConclusionOur results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Paul D. N. Hebert,et al.  Identifying spiders through DNA barcodes , 2005 .

[3]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[4]  C. Meyer,et al.  The Controversy , 2022 .

[5]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[6]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[7]  S. Ball,et al.  DNA barcodes for biosecurity: invasive species identification , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  D. Lees,et al.  High mitochondrial diversity in geographically widespread butterflies of Madagascar: a test of the DNA barcoding approach. , 2009, Molecular phylogenetics and evolution.

[9]  R. Nielsen,et al.  Statistical approaches for DNA barcoding. , 2006, Systematic biology.

[10]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[11]  G. Saunders,et al.  Applying DNA barcoding to red macroalgae: a preliminary appraisal holds promise for future applications , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[12]  A. Zhang,et al.  Inferring species membership using DNA sequences with back-propagation neural networks. , 2008, Systematic biology.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Nicolas Salamin,et al.  Land plants and DNA barcodes: short-term and long-term goals , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[15]  R. Nielsen,et al.  A likelihood ratio test for species membership based on DNA sequence data , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[16]  D. Janzen,et al.  Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[18]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Zaid Abdo,et al.  A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups. , 2007, Systematic biology.

[21]  A. Meyer,et al.  TaxI: a software tool for DNA barcoding using distance methods , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[22]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[23]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[24]  Paolo Frasconi,et al.  Weighted decomposition kernels , 2005, ICML.

[25]  R. Ward,et al.  DNA barcoding Australia's fish species , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[26]  R. Hanner,et al.  DNA barcoding detects market substitution in North American seafood , 2008 .

[27]  Dirk Steinke,et al.  Identification of shark and ray fins using DNA barcoding , 2009 .

[28]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[29]  Dirk Steinke,et al.  DNA barcoding for the identification of smoked fish products , 2008 .

[30]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[31]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[32]  W. John Kress,et al.  DNA Barcoding—a Windfall for Tropical Biology? , 2008 .

[33]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[34]  D. Janzen,et al.  Use of DNA barcodes to identify flowering plants. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Elizabeth Matisoo-Smith,et al.  Identifying Rattus species using mitochondrial DNA. , 2007 .

[36]  Louisa Flintoft,et al.  A barcode for life? , 2004, Nature Reviews Genetics.