We study approximations to the distribution of counts of matches in the best matching segment of specified length when comparing two long sequences of i.i.d. letters. The key tools used are large-deviation inequalities and the Chen-Stein method of Poisson approximation. The origin of the problem in molecular biology is indicated. 1. Introduction. A strand of DNA can be represented as a long string of letters from the four-letter alphabet {u, c, g , t). Currently, a large amount of laboratory effort is being expended in the determination and subsequent compilation of genetic information from various organisms. This information consists of listings of these long strings. A natural question arises from comparison of two or more such strings, by biologists' efforts to determine when a comparison detects an unusual congruence shared among the compared strings. Such statistical problems are naturally cast in the usual hypothesis testing context, in which we need to compute the tail probability (the biologists' p-value) for a seemingly unusual event. The work we report here is motivated by the scientific desire to compute the sort of tail probabilities of interest to molecular biologists in their evaluation of closely matching regions of different biological sequences. Until recently, the standard tool used in computing tail probabilities was a probabilistic use of the Bonferroni inequalities as pioneered in Watson (1954). Such calculations essentially establish a Poisson approximation for the distribution of counts of weakly dependent rare events. See, for example, the moment calculations in Karlin and Ost (1987) and the discussion in Karlin, Ghandour Ost, Tavare and Korn (1983). Use of the Bonferroni inequalities requires computation of moments of arbitrarily large order; the task is always tedious and frequently technically demanding. A promising alternative to using Bonferroni methods to establish the Poisson approximation for dependent events is to use methods developed in Chen-Stein method of Poisson approximation is generalized to a multivariate context, and various examples relevant to sequence matching are presented. Indeed, the realization that the results of Arratia, Gordon and Waterman (1986) can be obtained without the high-order moment calculations required by Bonferroni methods has enabled us to cope successfully with problems
[1]
A. Rényi,et al.
On a new law of large numbers
,
1970
.
[2]
J. Komlos,et al.
On Sequences of "Pure Heads"
,
1975
.
[3]
Louis H. Y. Chen.
Poisson Approximation for Dependent Trials
,
1975
.
[4]
Peter Hall,et al.
Estimating probabilities for normal extremes
,
1980,
Advances in Applied Probability.
[5]
Andrew Odlyzko,et al.
Long repetitive patterns in random sequences
,
1980
.
[6]
A. Barbour.
Poisson convergence and random graphs
,
1982
.
[7]
Joseph Naus,et al.
Approximations for Distributions of Scan Statistics
,
1982
.
[8]
L. J. Korn,et al.
New approaches for computer analysis of nucleic acid sequences.
,
1983,
Proceedings of the National Academy of Sciences of the United States of America.
[9]
David Sankoff,et al.
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison
,
1983
.
[10]
S. Varadhan.
Large Deviations and Applications
,
1984
.
[11]
Michael S. Waterman,et al.
Critical Phenomena in Sequence Matching
,
1985
.
[12]
Michael S. Waterman,et al.
An extreme value theory for long head runs
,
1986
.
[13]
T. Kohchi,et al.
Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA
,
1986,
Nature.
[14]
C. Stein.
Approximate computation of expectations
,
1986
.
[15]
Luc Devroye,et al.
Exact Convergence Rate in the Limit Theorems of Erdos-Renyi and Shepp
,
1986
.
[16]
L. Devroye,et al.
Limit laws of Erdös-Rényi-Shepp type
,
1987
.
[17]
Samuel Karlin,et al.
Counts of long aligned word matches among random letter sequences
,
1987,
Advances in Applied Probability.
[18]
D. Aldous.
Probability Approximations via the Poisson Clumping Heuristic
,
1988
.
[19]
Lars Holst,et al.
Some applications of the Stein-Chen method for proving Poisson convergence
,
1989,
Advances in Applied Probability.