Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion

Phylogenetic trees have a multitude of applications in biology, epidemiology, conservation and even forensics. However, the inference of phylogenetic trees can be extremely computationally intensive. The computational burden of such analyses becomes even greater when model-based methods are used. Model-based methods have been repeatedly shown to be the most accurate choice for the reconstruction of phylogenetic trees, and thus are an attractive choice despite their high computational demands. Using the Maximum Likelihood (ML) criterion to choose among phylogenetic trees is one commonly used model-based technique. Until recently, software for performing ML analyses of biological sequence data was largely intractable for more vi than about one hundred sequences. Because advances in sequencing technology now make the assembly of datasets consisting of thousands of sequences common, ML search algorithms that are able to quickly and accurately analyze such data must be developed if ML techniques are to remain a viable option in the future. I have developed a fast and accurate algorithm that allows ML phylogenetic searches to be performed on datasets consisting of thousands of sequences. My software uses a genetic algorithm approach, and is named GARLI (Genetic Algorithm for Rapid Likelihood Inference). The speed of this new algorithm results primarily from its novel technique for partial optimization of branch-length parameters following topological rearrangements. Experiments performed with GARLI show that it is able to analyze large datasets in a small fraction of the time required by the previous generation of search algorithms. The program also performs well relative to two other recently introduced fast ML search programs. Large parallel computer clusters have become common at academic institutions in recent years, presenting a new resource to be used for phylogenetic analyses. The P-GARLI algorithm extends the approach of GARLI to allow simultaneous use of many computer processors. The processors may be instructed to work together on a phylogenetic search in either a highly coordinated or largely independent fashion.

[1]  R. Punnett,et al.  The Genetical Theory of Natural Selection , 1930, Nature.

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[4]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[5]  R. Gutell,et al.  Secondary structure model for bacterial 16S ribosomal RNA: phylogenetic, enzymatic and chemical evidence. , 1980, Nucleic acids research.

[6]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[7]  Alistair I. Mees,et al.  Convergence of an annealing algorithm , 1986, Math. Program..

[8]  D. Hull,et al.  Science as a Process: An Evolutionary Account of the Social and Conceptual Development of Science, David L. Hull. 1988. The University of Chicago Press, Chicago, IL. 608 pages. ISBN: 0-226-35060-4. $39.95 , 1989 .

[9]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[10]  D. Maddison The discovery and importance of multiple islands of most , 1991 .

[11]  J. Mullins,et al.  Molecular Epidemiology of HIV Transmission in a Dental Practice , 1992, Science.

[12]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[13]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[14]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[15]  James F. Smith Phylogenetics of seed plants : An analysis of nucleotide sequences from the plastid gene rbcL , 1993 .

[16]  Daniel P. Faith,et al.  Genetic diversity and taxonomic priorities for conservation , 1994 .

[17]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[18]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[19]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[20]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[21]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[22]  K. Holsinger,et al.  Among-site rate variation and phylogenetic analysis of 12S rRNA in sigmodontine rodents. , 1995, Molecular biology and evolution.

[23]  H Matsuda,et al.  Protein phylogenetic inference using maximum likelihood with a genetic algorithm. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  D. Hillis Inferring complex phytogenies , 1996, Nature.

[25]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[26]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[27]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[28]  P. Lewis,et al.  A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. , 1998, Molecular biology and evolution.

[29]  S. Poe,et al.  The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny , 1998 .

[30]  Atte Moilanen,et al.  Searching for Most Parsimonious Trees with Simulated Evolutionary Optimization , 1999 .

[31]  W. Fitch,et al.  Predicting the evolution of human influenza A. , 1999, Science.

[32]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[33]  D. Soltis,et al.  Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology , 1999, Nature.

[34]  Hidetoshi Shimodaira,et al.  Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference , 1999, Molecular Biology and Evolution.

[35]  Riccardo Poli,et al.  Parallel genetic algorithm taxonomy , 1999, 1999 Third International Conference on Knowledge-Based Intelligent Information Engineering Systems. Proceedings (Cat. No.99TH8410).

[36]  A. Rodrigo,et al.  Likelihood-based tests of topologies in phylogenetics. , 2000, Systematic biology.

[37]  D. Pearl,et al.  Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. , 2001, Systematic biology.

[38]  M. Steel,et al.  Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees , 2001 .

[39]  C. Simon,et al.  Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. , 2001, Systematic biology.

[40]  K. Crandall,et al.  Selecting the best-fit model of nucleotide substitution. , 2001, Systematic biology.

[41]  Erick Cantú-Paz,et al.  Efficient and Accurate Parallel Genetic Algorithms , 2000, Genetic Algorithms and Evolutionary Computation.

[42]  Sudhir Kumar,et al.  Incomplete taxon sampling is not a problem for phylogenetic inference , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[43]  J. S. Rogers,et al.  Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. , 2001, Systematic biology.

[44]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[45]  Clare Bates Congdon Gaphyl: A Genetic Algorithms Approach to Cladistics , 2001, PKDD.

[46]  A. Lemmon,et al.  The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Derrick J. Zwickl,et al.  Increased taxon sampling greatly reduces phylogenetic error. , 2002, Systematic biology.

[48]  Matthew J. Brauer,et al.  Genetic algorithms and parallel processing in maximum-likelihood phylogeny inference. , 2002, Molecular biology and evolution.

[49]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[50]  Sudhir Kumar,et al.  Taxon sampling, bioinformatics, and phylogenomics. , 2003, Systematic biology.

[51]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[52]  A. von Haeseler,et al.  IQPNNI: moving fast through tree space and stopping in time. , 2004, Molecular biology and evolution.

[53]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[54]  David L. Swofford,et al.  Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics , 1997, Journal of Mammalian Evolution.

[55]  N. Goldman Simple diagnostic statistical tests of models for DNA substitution , 1993, Journal of Molecular Evolution.

[56]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[57]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[58]  Tandy J. Warnow,et al.  Phylogenetic networks: modeling, reconstructibility, and accuracy , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[59]  Zaid Abdo,et al.  Evaluating the performance of a successive-approximations approach to parameter optimization in maximum-likelihood phylogeny estimation. , 2005, Molecular biology and evolution.

[60]  Y. Tateno,et al.  Robustness of maximum likelihood tree estimation against different patterns of base substitutions , 2005, Journal of Molecular Evolution.

[61]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[62]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[63]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[64]  S. Carroll,et al.  More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. , 2005, Molecular biology and evolution.

[65]  D. Hillis,et al.  Phylogeny of the New World true frogs (Rana). , 2005, Molecular phylogenetics and evolution.

[66]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[67]  Alexandros Stamatakis,et al.  An efficient program for phylogenetic inference using simulated annealing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[68]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[69]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[70]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[71]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.