A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups.

A major part of the barcoding of life problem is assigning newly sequenced or sampled individuals to existing groups that are preidentified externally (by a taxonomist, for example). This problem involves evaluating the statistical evidence towards associating a sequence from a new individual with one group or another. The main concern of our current research is to perform this task in a fast and accurate manner. To accomplish this we have developed a model-based, decision-theoretic framework based on the coalescent theory. Under this framework, we utilized both distance and the posterior probability of a group, given the sequences from members of this group and the sequence from a newly sampled individual to assign this new individual. We believe that this approach makes efficient use of the available information in the data. Our preliminary results indicated that this approach is more accurate than using a simple measure of distance for assignment.

[1]  R. Nielsen,et al.  Maximum likelihood estimation of population divergence times and population phylogenies under the infinite sites model. , 1998, Theoretical population biology.

[2]  C. Neuhauser Mathematical Models in Population Genetics , 2004 .

[3]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[4]  P. Hebert,et al.  Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[5]  G. Reinsel,et al.  Introduction to Mathematical Statistics (4th ed.). , 1980 .

[6]  S. Sampling theory for neutral alleles in a varying environment , 2003 .

[7]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[8]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[9]  Jon A Yamato,et al.  Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. , 1995, Genetics.

[10]  Mark Blaxter,et al.  Molecular barcodes for soil nematode identification , 2002, Molecular ecology.

[11]  C. J-F,et al.  THE COALESCENT , 1980 .

[12]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[13]  Jon A Yamato,et al.  Maximum likelihood estimation of population growth rates based on the coalescent. , 1998, Genetics.

[14]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[15]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[16]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[17]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[18]  R. Nielsen,et al.  Statistical approaches for DNA barcoding. , 2006, Systematic biology.

[19]  M. Stephens,et al.  Inference Under the Coalescent , 2004 .

[20]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[21]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[22]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[23]  Christopher J. Williams,et al.  Statistical methods for characterizing diversity of microbial communities by analysis of terminal restriction fragment length polymorphisms of 16S rRNA genes. , 2006, Environmental microbiology.

[24]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[25]  D. Janzen,et al.  DNA barcodes distinguish species of tropical Lepidoptera. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Hudson Properties of a neutral allele model with intragenic recombination. , 1983, Theoretical population biology.

[27]  R. Nielsen,et al.  Distinguishing migration from isolation: a Markov chain Monte Carlo approach. , 2001, Genetics.

[28]  Jack Sullivan,et al.  Model Selection in Phylogenetics , 2005 .

[29]  Adrian E. Raftery,et al.  Hypothesis testing and model selection , 1996 .

[30]  Robert V. Hogg,et al.  Introduction to Mathematical Statistics. , 1966 .

[31]  C. Simulating Probability Distributions in the Coalescent * , 2022 .

[32]  Walter R. Gilks,et al.  Hypothesis testing and model selection , 1995 .

[33]  P. Hebert,et al.  Testing the utility of partial COI sequences for phylogenetic estimates of gastropod relationships. , 2003, Molecular phylogenetics and evolution.

[34]  M. Nordborg,et al.  Coalescent Theory , 2019, Handbook of Statistical Genomics.

[35]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[36]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[37]  D. Janzen,et al.  Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[38]  R. Nielsen,et al.  A likelihood ratio test for species membership based on DNA sequence data , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[39]  S. Tavaré,et al.  Ancestral Inference in Population Genetics , 1994 .

[40]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[41]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[42]  S. Tavaré,et al.  Sampling theory for neutral alleles in a varying environment. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[43]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.

[44]  P. Donnelly,et al.  Inference in molecular population genetics , 2000 .

[45]  P. Hebert,et al.  Identification of Birds through DNA Barcodes , 2004, PLoS biology.

[46]  J. Felsenstein,et al.  Estimating effective population size from samples of sequences: a bootstrap Monte Carlo integration method. , 1992, Genetical research.

[47]  Yuguo Chen,et al.  Stopping‐time resampling for sequential Monte Carlo methods , 2005 .

[48]  Jeremy R. deWaard,et al.  Biological identifications through DNA barcodes , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.