An Erdös-Rényi law with shifts

Abstract Motivated by the comparison of DNA sequences, a generalization is given of the result of Erdos and Renyi on the length R n of the longest run of heads in the first n tosses of a coin. Consider two sequences, X 1 X 2 … X n and Y 1 Y 2 … Y n . The length of the longest matching consecutive subsequence, allowing shifts, is M n ≡ max{ m : X i + k = Y j + k for k = 1 to m, for some 0 ⩽ i, j ⩽ n − m}. Suppose that all the “letters” are independent and identically distributed. The length of the longest match without shifts has the same distribution as R n , the length of the longest head run for a biased coin with p = P ( X i = Y i ), described by the Erdos-Renyi law: P( lim n → ∞ R n log 1 p (n) = 1) = 1 . For matching with shifts, our result is: P( lim n → ∞ M n log 1 p (n) = 2) = 1 . Loosely speaking, allowing shifts doubles the length of the longest match. The case of Markov chains is also handled.

[1]  Kai Lai Chung,et al.  Markov Chains with Stationary Transition Probabilities , 1961 .

[2]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[3]  Andrew Odlyzko,et al.  Long repetitive patterns in random sequences , 1980 .

[4]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[5]  L. J. Korn,et al.  New approaches for computer analysis of nucleic acid sequences. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  A. Rényi,et al.  On a new law of large numbers , 1970 .