Evolutionary model selection with a genetic algorithm: a case study using stem RNA.

The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings. We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models. Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.

[1]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[2]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[3]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[4]  E. Tillier,et al.  Neighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites , 1995 .

[5]  Sergei L. Kosakovsky Pond,et al.  A genetic algorithm approach to detecting lineage-specific variation in selection pressure. , 2005, Molecular biology and evolution.

[6]  Bruce A. Shapiro,et al.  A massively parallel genetic algorithm for RNA secondary structure prediction , 1994, The Journal of Supercomputing.

[7]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[8]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[9]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Hidetoshi Shimodaira,et al.  Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference , 1999, Molecular Biology and Evolution.

[12]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[13]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[14]  Broome,et al.  Literature cited , 1924, A Guide to the Carnivores of Central America.

[15]  Larry J. Eshelman The CHC Adaptive Search Algo-rithm , 1991 .

[16]  D. Pillay,et al.  Investigation of HIV-1 transmission events by phylogenetic methods: requirement for scientific rigour. , 2005, AIDS.

[17]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[18]  H. Akaike A new look at the statistical model identification , 1974 .

[19]  Richard A. Goldstein,et al.  rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny , 2002, Journal of Molecular Evolution.

[20]  Simon D W Frost,et al.  A simple hierarchical approach to modeling distributions of substitution rates. , 2005, Molecular biology and evolution.

[21]  D. Higgins,et al.  RAGA: RNA sequence alignment by genetic algorithm. , 1997, Nucleic acids research.

[22]  Michael E Alfaro,et al.  Comparative performance of Bayesian and AIC-based measures of phylogenetic model uncertainty. , 2006, Systematic biology.

[23]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[24]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[25]  Elisabeth Renée,et al.  Maximum likelihood with multiparameter models of substitution , 1994, Journal of Molecular Evolution.

[26]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[27]  Michael Schöniger,et al.  Toward Assigning Helical Regions in Alignments of Ribosomal RNA and Testing the Appropriateness of Evolutionary Models , 1999, Journal of Molecular Evolution.

[28]  J. Huelsenbeck,et al.  Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. , 2004, Molecular biology and evolution.

[29]  L. Darrell Whitley,et al.  An overview of evolutionary algorithms: practical issues and common pitfalls , 2001, Inf. Softw. Technol..

[30]  A. Rzhetsky Estimating substitution rates in ribosomal RNA genes. , 1995, Genetics.

[31]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[32]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[33]  Vivek Gowri-Shankar,et al.  Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the bilateria. , 2005, Molecular biology and evolution.

[34]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[35]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[36]  Kaizhong Zhang,et al.  RNA molecules with structure dependent functions are uniquely folded. , 2002, Nucleic acids research.

[37]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[38]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[39]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[40]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[41]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[42]  C. Loan,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix , 1978 .

[43]  D. Hoyle,et al.  RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. , 2001, Genetics.