Probabilistic models of evolution and language change

Both linguistics and biology face scientific questions that require reconstructing phylogenies and ancestral sequences from a collection of modern descendants. In linguistics, these ancestral sequences are the words that appeared in the protolanguages from which modern languages evolved. Linguists painstakingly reconstruct these words by hand using knowledge of the relationships between languages and the plausibility of sound changes. In biology, analogous questions concern the DNA, RNA, or protein sequences of ancestral genes and genomes. By reconstructing ancestral sequences and the evolutionary paths between them, biologists can make inferences about the evolution of gene function and the nature of the environment in which they evolved. In this work, we describe several probabilistic models designed to attack the main phylogenetic problems (tree inference, ancestral sequence reconstruction, and multiple sequence alignment). For each model, we discussing the issues of representation, inference, analysis and empirical evaluation. Among the contributions, we propose the first computational approach to diachronic phonology scalable to large scale phylogenies. Sound changes and markedness are taken into account using a flexible feature-based unsupervised learning framework. Using this model, we attacked a 50-year-old open problem in linguistics regarding the role of functional load in language change. We also introduce three novel algorithms for inferring multiple sequence alignments, and a stochastic process allowing joint, accurate and efficient inference of phylogenetic trees and multiple sequence alignments. Finally, many of the tools developed to do inference over these models are applicable more broadly, creating a transfer of idea from phylogenetics into machine learning as well. In particular, the variational framework used for multiple sequence alignment extends to a broad class of combinatorial inference problems.

[1]  R. M. Meyer Guns, Germs, and Steel: The Fates of Human Societies , 2000 .

[2]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[3]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[4]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[5]  Lars Eilstrup Rasmussen,et al.  Approximating the Permanent: A Simple Approach , 1994, Random Struct. Algorithms.

[6]  T. Warnow,et al.  INFERENCE OF DIVERGENCE TIMES AS A STATISTICAL INVERSE PROBLEM , 2004 .

[7]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[8]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[9]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[10]  H. H. Hock Principles of historical linguistics , 1986 .

[11]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[12]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[13]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[14]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[15]  Malcolm Ross,et al.  The lexicon of Proto Oceanic : The culture and environment of ancestral Oceanic society , 2016 .

[16]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[17]  R. Durrett Probability: Theory and Examples , 1993 .

[18]  Edward J. Vajda,et al.  A Siberian Link with Na-Dene Languages , 2010 .

[19]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[20]  Colin Wilson,et al.  Learning Phonology With Substantive Bias: An Experimental and Computational Study of Velar Palatalization , 2006, Cogn. Sci..

[21]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[22]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[23]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[24]  Christian P. Robert,et al.  The Bayesian choice : from decision-theoretic foundations to computational implementation , 2007 .

[25]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[26]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[27]  Robert D. King,et al.  Functional Load and Sound Change , 1967 .

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[30]  Michael I. Jordan,et al.  Variational inference in graphical models: The view from the marginal polytope , 2008 .

[31]  Dan Klein,et al.  Efficient Inference in Phylogenetic InDel Trees , 2008, NIPS.

[32]  Carsten Peterson,et al.  A Mean Field Theory Learning Algorithm for Neural Networks , 1987, Complex Syst..

[33]  S. Sampling theory for neutral alleles in a varying environment , 2003 .

[34]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[35]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[36]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[37]  Michael Chertkov,et al.  Belief propagation and loop calculus for the permanent of a non-negative matrix , 2009, ArXiv.

[38]  Paul H. J. Kelly,et al.  A dynamic topological sort algorithm for directed acyclic graphs , 2007, ACM J. Exp. Algorithmics.

[39]  R. Blust CENTRAL AND CENTRAL- EASTERN MALAYO-POLYNESIAN , 1993 .

[40]  Dan Klein,et al.  Joint Parsing and Alignment with Weakly Synchronized Grammars , 2010, NAACL.

[41]  M. R. Leadbetter Poisson Processes , 2011, International Encyclopedia of Statistical Science.

[42]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[43]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[44]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[45]  Michael I. Jordan,et al.  Optimization of Structured Mean Field Objectives , 2009, UAI.

[46]  Bert Huang,et al.  Approximating the Permanent with Belief Propagation , 2009, ArXiv.

[47]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[48]  Bart Selman,et al.  Sparse Message Passing Algorithms for Weighted Maximum Satisfiability , 2007 .

[49]  D. Wilson Mixing times of lozenge tiling and card shuffling Markov chains , 2001, math/0102193.

[50]  M. Droste,et al.  Handbook of Weighted Automata , 2009 .

[51]  I. Holmes,et al.  Tools for simulating evolution of aligned genomic regions with integrated parameter estimation , 2008, Genome Biology.

[52]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[53]  C. F. Hockett The Quantification of Functional Load , 1967 .

[54]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[55]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[56]  Anatole V. Lyovin,et al.  An introduction to the languages of the world , 1997 .

[57]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[58]  Zoltán Toroczkai,et al.  An Improved Model for Statistical Alignment , 2001, WABI.

[59]  Johanna NlCHOLS,et al.  The Eurasian spread zone and the Indo-European dispersal , 1998 .

[60]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[61]  Bruce E. Litow The Hamiltonian circuit problem and automaton theory , 2003, SIGA.

[62]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[63]  Bernd Nothofer,et al.  The reconstruction of Proto-Malayo-Javanic , 1975 .

[64]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[65]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[66]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[67]  Michael P. Oakes,et al.  Computer Estimation of Vocabulary in a Protolanguage from Word Lists in Four Daughter Languages , 2000, J. Quant. Linguistics.

[68]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction , 2010, RECOMB.

[69]  István Miklós,et al.  Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model , 2003, WABI.

[70]  Radford M. Neal,et al.  ANALYSIS OF A NONREVERSIBLE MARKOV CHAIN SAMPLER , 2000 .

[71]  Mark Johnson,et al.  Learning OT constraint rankings using a maximum entropy model , 2003 .

[72]  Tandy J. Warnow,et al.  Phylogenetic networks: modeling, reconstructibility, and accuracy , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[73]  André Martinet,et al.  Economie des changements phon??tiques , 1957 .

[74]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[75]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[76]  Dan Klein,et al.  Agreement-Based Learning , 2007, NIPS.

[77]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[78]  Tandy J. Warnow,et al.  The Effect of the Guide Tree on Multiple Sequence Alignments and Subsequent Phylogenetic Analysis , 2007, Pacific Symposium on Biocomputing.

[79]  G. Brightwell,et al.  Counting linear extensions , 1991 .

[80]  P. Niyogi,et al.  Quantifying the functional load of phonemic oppositions, distinctive features, and suprasegmentals , 2006 .

[81]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[82]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[83]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[84]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[85]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[86]  A. Hobolth,et al.  Statistical Applications in Genetics and Molecular Biology Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm , 2011 .

[87]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[88]  M. Newton,et al.  Phylogenetic Inference for Binary Data on Dendograms Using Markov Chain Monte Carlo , 1997 .

[89]  J. L. Jensen,et al.  GIBBS SAMPLER FOR STATISTICAL MULTIPLE ALIGNMENT , 2005 .

[90]  David A. Smith,et al.  Dependency Parsing by Belief Propagation , 2008, EMNLP.

[91]  Thomas Hofmann,et al.  Using Combinatorial Optimization within Max-Product Belief Propagation , 2007 .

[92]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..