Motif statistics

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra in order to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulae that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.

[1]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[2]  Anne Brüggemann-Klein Regular Expressions into Finite Automata , 1993, Theor. Comput. Sci..

[3]  Edward A. Bender,et al.  Central and Local Limit Theorems Applied to Asymptotic Enumeration , 1973, J. Comb. Theory A.

[4]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[5]  Victor J. Rayward-Smith,et al.  A first course in formal language theory , 1983 .

[6]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[7]  Bruno Salvy,et al.  Effective asymptotics of linear recurrences with rational coefficients , 1996, Discret. Math..

[8]  P. Pevzner,et al.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. , 1989, Journal of biomolecular structure & dynamics.

[9]  Mireille Régnier,et al.  On Pattern Frequency Occurrences in a Markovian Sequence , 1998, Algorithmica.

[10]  Kevin Atteson,et al.  Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences , 1998, ISMB.

[11]  Sophie Schbath,et al.  Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences , 1995, J. Comput. Biol..

[12]  Noam Chomsky,et al.  The Algebraic Theory of Context-Free Languages* , 1963 .

[13]  Edward A. Bender,et al.  The Distribution of Subword Counts is Usually Normal , 1993, Eur. J. Comb..

[14]  Richard Durbin,et al.  Method for Calculation of Probability of Matching a Bounded Regular Expression in a Random Data String , 1995, J. Comput. Biol..

[15]  Bruno Salvy,et al.  GFUN: a Maple package for the manipulation of generating and holonomic functions in one variable , 1994, TOMS.

[16]  Dexter Kozen,et al.  Automata and Computability , 1997, Undergraduate Texts in Computer Science.

[17]  Philippe Flajolet,et al.  Deviations from uniformity in random strings , 1988 .

[18]  V. Prasolov Problems and theorems in linear algebra , 1994 .

[19]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[20]  Mireille Régnier,et al.  A unified approach to word statistics , 1998, RECOMB '98.

[21]  Edward A. Bender,et al.  Central and Local Limit Theorems Applied to Asymptotic Enumeration. III. Matrix Recursions , 1983, J. Comb. Theory, Ser. A.

[22]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[23]  L. Mirsky,et al.  The Theory of Matrices , 1961, The Mathematical Gazette.

[24]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[25]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[26]  Gérard Berry,et al.  From Regular Expressions to Deterministic Automata , 1986, Theor. Comput. Sci..