A Note on Probabilistic Models over Strings: The Linear Algebra Approach

Probabilistic models over strings have played a key role in developing methods that take into consideration indels as phylogenetically informative events. There is an extensive literature on using automata and transducers on phylogenies to do inference on these probabilistic models, in which an important theoretical question is the complexity of computing the normalization of a class of string-valued graphical models. This question has been investigated using tools from combinatorics, dynamic programming, and graph theory, and has practical applications in Bayesian phylogenetics. In this work, we revisit this theoretical question from a different point of view, based on linear algebra. The main contribution is a set of results based on this linear algebra view that facilitate the analysis and design of inference algorithms on string-valued graphical models. As an illustration, we use this method to give a new elementary proof of a known result on the complexity of inference on the “TKF91” model, a well-known probabilistic model over strings. Compared to previous work, our proving method is easier to extend to other models, since it relies on a novel weak condition, triangular transducers, which is easy to establish in practice. The linear algebra view provides a concise way of describing transducer algorithms and their compositions, opens the possibility of transferring fast linear algebra libraries (for example, based on GPUs), as well as low rank matrix approximation methods, to string-valued inference problems.

[1]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[2]  Bruce E. Litow The Hamiltonian circuit problem and automaton theory , 2003, SIGA.

[3]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[4]  Ian Holmes,et al.  Transducers: an emerging probabilistic framework for modeling indels on trees , 2007, Bioinform..

[5]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[6]  Yun S. Song A Sufficient Condition for Reducing Recursions in Hidden Markov Models , 2006, Bulletin of mathematical biology.

[7]  Michael I. Jordan,et al.  Evolutionary inference via the Poisson Indel Process , 2012, Proceedings of the National Academy of Sciences.

[8]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[9]  István Miklós,et al.  Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model , 2003, WABI.

[10]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[11]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[12]  A. von Haeseler,et al.  Assessing Variability by Joint Sampling of Alignments and Mutation Rates , 2001, Journal of Molecular Evolution.

[13]  Dan Klein,et al.  Efficient Inference in Phylogenetic InDel Trees , 2008, NIPS.

[14]  Samuel Eilenberg,et al.  Automata, languages, and machines. A , 1974, Pure and applied mathematics.

[15]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[16]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[17]  D. Higdon Auxiliary Variable Methods for Markov Chain Monte Carlo with Applications , 1998 .

[18]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[19]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[20]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[21]  Yee Whye Teh,et al.  An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering , 2008, NIPS.

[22]  Michael I. Jordan Graphical Models , 2003 .

[23]  M. Droste,et al.  Handbook of Weighted Automata , 2009 .

[24]  Makoto Kato,et al.  Evolution and phylogenetic utility of alignment gaps within intron sequences of three nuclear genes in bumble bees (Bombus). , 2003, Molecular biology and evolution.

[25]  Yee Whye Teh,et al.  Bayesian Agglomerative Clustering with Coalescents , 2007, NIPS.

[26]  Ian Holmes Phylocomposer and phylodirector: analysis and visualization of transducer indel models , 2007, Bioinform..

[27]  Paulo Fernandes,et al.  Optimizing tensor product computations in stochastic automata networks , 1998 .

[28]  Edoardo M. Airoldi,et al.  Getting Started in Probabilistic Graphical Models , 2007, PLoS Comput. Biol..

[29]  Zoltán Toroczkai,et al.  An Improved Model for Statistical Alignment , 2001, WABI.

[30]  I. Holmes,et al.  Phylogenetic automata, pruning, and multiple alignment , 2011, 1103.4347.

[31]  W. Stewart,et al.  The Kronecker product and stochastic automata networks , 2004 .

[32]  J. L. Jensen,et al.  GIBBS SAMPLER FOR STATISTICAL MULTIPLE ALIGNMENT , 2005 .

[33]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[34]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[35]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[36]  Michael I. Jordan,et al.  Phylogenetic Inference via Sequential Monte Carlo , 2012, Systematic biology.

[37]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[38]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[39]  M. Suchard,et al.  Incorporating indel information into phylogeny estimation for rapidly emerging pathogens , 2007, BMC Evolutionary Biology.

[40]  I. Holmes,et al.  Accurate Reconstruction of Insertion-Deletion Histories by Statistical Phylogenetics , 2012, PloS one.

[41]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[42]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[43]  Mike A. Steel,et al.  Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree , 2001, Appl. Math. Lett..

[44]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[45]  Mehryar Mohri,et al.  Generic e-Removal and Input e-Normalization Algorithms for Weighted Transducers , 2002, Int. J. Found. Comput. Sci..

[46]  Lior Pachter,et al.  Combining statistical alignment and phylogenetic footprinting to detect regulatory elements , 2008, Bioinform..

[47]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[48]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[49]  Jens Ledet Jensen,et al.  Recursions for statistical multiple alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.