Probabilistic Arithmetic Automata and Their Applications

We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models; 2) exact analysis of window-based pattern matching algorithms; 3) sensitivity of filtration seeds used to detect candidate sequence alignments; 4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[3]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[4]  Hans-Michael Kaltenbach,et al.  Statistics and algorithms for peptide mass fingerprinting , 2007 .

[5]  Sven Rahmann,et al.  Exact Analysis of Horspool's and Sunday's Pattern Matching Algorithms with Probabilistic Arithmetic Automata , 2010, LATA.

[6]  Mireille Régnier,et al.  Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules , 2007, Algorithms for Molecular Biology.

[7]  Ricardo A. Baeza-Yates,et al.  Average Running Time of the Boyer-Moore-Horspool Algorithm , 1992, Theor. Comput. Sci..

[8]  G. Nuel Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata , 2008 .

[9]  R. Matthiesen Mass Spectrometry Data Analysis in Proteomics , 2006, Methods in Molecular Biology.

[10]  S. Carr,et al.  The Essential Role of Mass Spectrometry in Characterizing Protein Structure: Mapping Posttranslational Modifications , 1997, Journal of protein chemistry.

[11]  Yong Kong Statistical Distributions of Pyrosequencing , 2009, J. Comput. Biol..

[12]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[13]  Louxin Zhang,et al.  Superiority and complexity of the spaced seeds , 2006, SODA '06.

[14]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[15]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[16]  Gary Benson,et al.  Indel seeds for homology search , 2006, ISMB.

[17]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[18]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[19]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[20]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[21]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[22]  Louxin Zhang,et al.  Sensitivity analysis and efficient method for identifying optimal spaced seeds , 2004, J. Comput. Syst. Sci..

[23]  I-Jeng Wang,et al.  A statistical model of proteolytic digestion , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[24]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[25]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[26]  Maxime Crochemore,et al.  Efficient Experimental String Matching by Weak Factor Recognition , 2001, CPM.

[27]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[28]  Sven Rahmann,et al.  Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics , 2008, CPM.

[29]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[30]  Daniel G. Brown Optimizing Multiple Seeds for Protein Homology Search , 2005, TCBB.

[31]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[32]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[33]  Gary Benson,et al.  All Hits All The Time: Parameter Free Calculation of Seed Sensitivity , 2007, APBC.

[34]  Robert T. Smythe The Boyer-Moore-Horspool heuristic with Markovian input , 2001, Random Struct. Algorithms.

[35]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[36]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[37]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[38]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[39]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[40]  Dekel Tsur,et al.  Identification of post-translational modifications via blind search of mass-spectra , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[41]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[42]  Tsung-Hsi Tsai,et al.  Average case analysis of the Boyer‐Moore algorithm , 2006, Random Struct. Algorithms.

[43]  Sven Rahmann,et al.  Markov Additive Chains and Applications to Fragment Statistics for Peptide Mass Fingerprinting , 2006, Systems Biology and Computational Proteomics.

[44]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[45]  Timo Knuutila,et al.  Re-describing an algorithm by Hopcroft , 2001, Theor. Comput. Sci..

[46]  Martin Vingron,et al.  Fast and Adaptive Variable Order Markov Chain Construction , 2008, WABI.

[47]  Gad M. Landau,et al.  Construction of Aho Corasick automaton in linear time for integer alphabets , 2006, Inf. Process. Lett..

[48]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[49]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[50]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[51]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[52]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[53]  E. Çinlar Markov additive processes. I , 1972 .

[54]  Wojciech Plandowski,et al.  Speeding up two string-matching algorithms , 2005, Algorithmica.

[55]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[56]  Mireille Régnier,et al.  Analysis of Boyer-Moore-type string searching algorithms , 1990, SODA '90.

[57]  Sven Rahmann,et al.  An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms , 2011, Algorithms.

[58]  Etienne Roquain,et al.  Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary markov chain , 2007, Advances in Applied Probability.

[59]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[60]  M. Mann,et al.  Proteomic analysis of post-translational modifications , 2003, Nature Biotechnology.

[61]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[62]  Tobias Marschall,et al.  Construction of minimal deterministic finite automata from biological motifs , 2011, Theor. Comput. Sci..

[63]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[64]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[65]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[66]  Manuel E Lladser,et al.  Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[67]  Sven Rahmann,et al.  Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata , 2008, WABI.

[68]  Lucian Ilie,et al.  Multiple spaced seeds for homology search , 2007, Bioinform..

[69]  Jennifer A. Siepen,et al.  Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. , 2007, Journal of proteome research.

[70]  Mireille Régnier,et al.  Analysis of Boyer-Moore-Horspool string-matching heuristic , 1997, Random Struct. Algorithms.

[71]  Tobias Marschall Construction of minimal DFAs from biological motifs , 2010, ArXiv.

[72]  Yong Kong,et al.  Generalized Correlation Functions and Their Applications in Selection of Optimal Multiple Spaced Seeds for Homology Search , 2007, J. Comput. Biol..

[73]  Sven Rahmann,et al.  Exact Analysis of Pattern Matching Algorithms with Probabilistic Arithmetic Automata , 2010, ArXiv.

[74]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[75]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[76]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[77]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[78]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[79]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[80]  Philippe Flajolet,et al.  Motif Statistics , 1999, ESA.

[81]  E. Çinlar Markov additive processes. II , 1972 .

[82]  M. Lothaire Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications) , 2005 .