ModelTeller: model selection for optimal phylogenetic reconstruction using machine learning.

Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. While model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, while these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on datasets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared to existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization - features that describe the sequence divergence - and those that current criteria rely on, estimates of the model parameters.

[1]  Alexey M. Kozlov,et al.  ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models , 2019, bioRxiv.

[2]  Itay Mayrose,et al.  Model selection may not be a mandatory step for phylogeny reconstruction , 2019, Nature Communications.

[3]  Anton Suvorov,et al.  Accurate inference of tree topologies from multiple sequence alignments using deep learning , 2019, bioRxiv.

[4]  Robert Lanfear,et al.  PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. , 2016, Molecular biology and evolution.

[5]  Danilo Bzdok,et al.  Classical Statistics and Statistical Learning in Imaging Neuroscience , 2016, Front. Neurosci..

[6]  Alexandros Stamatakis,et al.  Does the choice of nucleotide substitution models matter topologically? , 2016, BMC Bioinformatics.

[7]  M. Kuhner,et al.  Practical performance of tree comparison metrics. , 2015, Systematic biology.

[8]  Richard Van Noorden,et al.  The top 100 papers , 2014, Nature.

[9]  Céline Scornavacca,et al.  OrthoMaM v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. , 2014, Molecular biology and evolution.

[10]  Arnold Kuzniar,et al.  Selectome update: quality control and computational improvements to a database of positive selection , 2013, Nucleic Acids Res..

[11]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[12]  Hilmar Lapp,et al.  NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata , 2012, Systematic biology.

[13]  Krzysztof Giaro,et al.  TreeCmp: Comparison of Trees in Polynomial Time , 2012, Evolutionary Bioinformatics Online.

[14]  Mateus Patricio,et al.  Genome-Wide Heterogeneity of Nucleotide Substitution Model Fit , 2011, Genome biology and evolution.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Nick Goldman,et al.  PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment , 2011, BMC Bioinformatics.

[17]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[18]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[19]  D. Posada jModelTest: phylogenetic model averaging. , 2008, Molecular biology and evolution.

[20]  Jack Sullivan,et al.  Does choice in model selection affect maximum likelihood analysis? , 2008, Systematic biology.

[21]  Frédéric Delsuc,et al.  OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics , 2007, BMC Evolutionary Biology.

[22]  Hyrum Carroll,et al.  DNA reference alignment benchmarks based on tertiary structure of encoded proteins , 2007, Bioinform..

[23]  S. Kotsiantis Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[24]  Michael A. Thomas,et al.  Model use in phylogenetics: nine key questions. , 2007, Trends in ecology & evolution.

[25]  Sean M. Polyn,et al.  Beyond mind-reading: multi-voxel pattern analysis of fMRI data , 2006, Trends in Cognitive Sciences.

[26]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[27]  Frédéric Delsuc,et al.  Heterotachy and long-branch attraction in phylogenetics , 2005, BMC Evolutionary Biology.

[28]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[29]  Thomas Ludwig,et al.  RAxML-OMP: An Efficient Program for Phylogenetic Inference on SMPs , 2005, PaCT.

[30]  Zaid Abdo,et al.  Evaluating the performance of a successive-approximations approach to parameter optimization in maximum-likelihood phylogeny estimation. , 2005, Molecular biology and evolution.

[31]  Zaid Abdo,et al.  Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. , 2005, Molecular biology and evolution.

[32]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[33]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[34]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[35]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[36]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[37]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[38]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[39]  David L. Swofford,et al.  Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics , 1997, Journal of Mammalian Evolution.

[40]  A. Zharkikh Estimation of evolutionary distances between nucleotide sequences , 1994, Journal of Molecular Evolution.

[41]  N. Goldman Simple diagnostic statistical tests of models for DNA substitution , 1993, Journal of Molecular Evolution.

[42]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[43]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[44]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.

[45]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[46]  Simon Whelan,et al.  Pandit: a database of protein and associated nucleotide domains with inferred trees , 2003, Bioinform..

[47]  Tal Pupko,et al.  Combining multiple data sets in a likelihood analysis: which models are the best? , 2002, Molecular biology and evolution.

[48]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[49]  C. Cunningham,et al.  The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. , 2002, Molecular biology and evolution.

[50]  D. Swofford,et al.  Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? , 2001, Systematic biology.

[51]  M J Sanderson,et al.  Sources of error and confidence intervals in estimating the age of angiosperms from rbcL and 18S rDNA data. , 2001, American journal of botany.

[52]  K. Crandall,et al.  Selecting the best-fit model of nucleotide substitution. , 2001, Systematic biology.

[53]  David Posada,et al.  The Effect of Branch Length Variation on the Selection of Models of Molecular Evolution , 2001, Journal of Molecular Evolution.

[54]  C. Simon,et al.  Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. , 2001, Systematic biology.

[55]  M J Sanderson,et al.  Parametric phylogenetics? , 2000, Systematic biology.

[56]  F J Ayala,et al.  A new method for characterizing replacement rate variation in molecular sequences. Application of the Fourier and wavelet models to Drosophila and mammalian proteins. , 2000, Genetics.

[57]  J. Zhang,et al.  Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. , 1999, Molecular biology and evolution.

[58]  Peer Bork,et al.  SMART: identification and annotation of domains from signalling and extracellular protein sequences , 1999, Nucleic Acids Res..

[59]  Nick Goldman,et al.  Phylogenetic information and experimental design in molecular systematics , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[60]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[61]  K. Crandall,et al.  Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood , 1997 .

[62]  B. Rannala,et al.  Phylogenetic methods come of age: testing hypotheses in an evolutionary context. , 1997, Science.

[63]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[64]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD TREES FROM DNA SEQUENCES: A PECULIAR STATISTICAL ESTIMATION PROBLEM , 1995 .

[65]  Richard M. Golden Making correct statistical inferences using a wrong probability model , 1995 .

[66]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[67]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[68]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[69]  G A Churchill,et al.  Sample size for a phylogenetic inference. , 1992, Molecular biology and evolution.

[70]  K. Tamura,et al.  Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. , 1992, Molecular biology and evolution.

[71]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[72]  Jack D. Cowan Some mathematical questions in biology: Neurobiology : Robert M. Miura (editor), Lectures on mathematics in the life sciences, Volume 15, Providence RI: American Mathematics Society, 1982 , 1984 .

[73]  J. Kent Robust properties of likelihood ratio tests , 1982 .

[74]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[75]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[76]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[77]  Robert V. Foutz,et al.  The Performance of the Likelihood Ratio Test When the Model is Incorrect , 1977 .

[78]  G. Box Science and Statistics , 1976 .

[79]  H. Akaike A new look at the statistical model identification , 1974 .

[80]  J. Cowan,et al.  Some mathematical questions in biology , 1974 .

[81]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[82]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[83]  H. Munro,et al.  Mammalian protein metabolism , 1964 .