We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a variety of pattern searches such as motif searches and gene-finding. This work improves upon work of Kleffe and Langebacker (Kleffe & Langbecker 1990) and of Sewell and Durbin (Sewell & Durbin 1995) in several ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbitrary length gaps. Also, the type of probability model which can be used is more general than that of Sewell and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NP-hard problems which are believed to be intractable. However, the problem is fixed-parameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whereas their method was only fast enough to exactly compute the probabilities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d. and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilo-bases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.
[1]
Sampath Kannan,et al.
A Quasi-Polynomial-Time Algorithm for Sampling Words from a Context-Free Language
,
1997,
Inf. Comput..
[2]
Alfred V. Aho,et al.
The Design and Analysis of Computer Algorithms
,
1974
.
[3]
William J. Stewart,et al.
Introduction to the numerical solution of Markov Chains
,
1994
.
[4]
J. Beckmann,et al.
Linguistics of nucleotide sequences: morphology and comparison of vocabularies.
,
1986,
Journal of biomolecular structure & dynamics.
[5]
Richard Durbin,et al.
Method for Calculation of Probability of Matching a Bounded Regular Expression in a Random Data String
,
1995,
J. Comput. Biol..
[6]
Rolf Apweiler,et al.
The SWISS-PROT protein sequence data bank and its new supplement TREMBL
,
1996,
Nucleic Acids Res..
[7]
Amos Bairoch,et al.
The PROSITE database, its status in 1995
,
1996,
Nucleic Acids Res..
[8]
Derick Wood,et al.
Theory of computation
,
1986
.
[9]
Michael Wolfe,et al.
J+ = J
,
1994,
ACM SIGPLAN Notices.
[10]
V. Pan,et al.
Polynomial and Matrix Computations
,
1994,
Progress in Theoretical Computer Science.
[11]
Jürgen Kleffe,et al.
Exact computation of pattern probabilities in random sequences generated by Markov chains
,
1990,
Comput. Appl. Biosci..
[12]
Paul Levi,et al.
GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA
,
1998,
German Conference on Bioinformatics.
[13]
홀덴 데이비드윌리암,et al.
Identification of genes
,
1995
.
[14]
D. Searls,et al.
Gene structure prediction by linguistic methods.
,
1994,
Genomics.
[15]
J. Rissanen,et al.
Modeling By Shortest Data Description*
,
1978,
Autom..