Abstract Motivated by the comparison of DNA sequences, a generalization is given of the result of Erdos and Renyi on the length R n of the longest run of heads in the first n tosses of a coin. Consider two sequences, X 1 X 2 … X n and Y 1 Y 2 … Y n . The length of the longest matching consecutive subsequence, allowing shifts, is M n ≡ max{ m : X i + k = Y j + k for k = 1 to m, for some 0 ⩽ i, j ⩽ n − m}. Suppose that all the “letters” are independent and identically distributed. The length of the longest match without shifts has the same distribution as R n , the length of the longest head run for a biased coin with p = P ( X i = Y i ), described by the Erdos-Renyi law: P( lim n → ∞ R n log 1 p (n) = 1) = 1 . For matching with shifts, our result is: P( lim n → ∞ M n log 1 p (n) = 2) = 1 . Loosely speaking, allowing shifts doubles the length of the longest match. The case of Markov chains is also handled.
[1]
Kai Lai Chung,et al.
Markov Chains with Stationary Transition Probabilities
,
1961
.
[2]
Temple F. Smith,et al.
The statistical distribution of nucleic acid similarities.
,
1985,
Nucleic acids research.
[3]
Andrew Odlyzko,et al.
Long repetitive patterns in random sequences
,
1980
.
[4]
V. Chvátal,et al.
Longest common subsequences of two random sequences
,
1975,
Advances in Applied Probability.
[5]
L. J. Korn,et al.
New approaches for computer analysis of nucleic acid sequences.
,
1983,
Proceedings of the National Academy of Sciences of the United States of America.
[6]
M S Waterman,et al.
Identification of common molecular subsequences.
,
1981,
Journal of molecular biology.
[7]
A. Rényi,et al.
On a new law of large numbers
,
1970
.