An Extreme Value Theory for Sequence Matching

of matches between the X's and Y's, allowing at most k mismatches. The distribution is closely approximated by that of the maximum of (1-p)mn i.i.d. negative binomial random variables. The latter distribution is in turn shown to behave like the integer part of an extreme value distribution. The expectation is approximately log(qmn) + k log log(qmn) + k log(q / p)-log(k!) + y log(e)-f , where q-1-p , log denotes logarithm base l/p, and y is the Euler constant. The variance is approximated by (w log(e))'/6 + &. The paper concludes with an example in which we compare segments taken from the DNA sequence of the bacteriophage lambda. 0. Introduction. DNA sequences can be represented as finite sequences over the four-letter alphabet {A, C, G, T). Such a sequence corresponds to successive appearances of the nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T). One of the impressive accomplishments of molecular biology is the facility with which the sequences corresponding to actual genetic material are determined. Much effort is currently invested in determining the DNA sequences belonging to the chromosomes of various organisms. See for example the book Nucleotide Sequences 1984 [Anderson et al. (1984)], which is an atlas of such representations. By mid-1985, DNA sequences with a total length of approximately 5 X lo6 were known, and sequencing was proceeding at an approximate rate of lo6 lettern per year. Sequences belonging to seemingly unrelated organisms have been found to posseas long contiguous subsequences which are practically identical. Doolittle et al. (1983) report an unexpected relationship of this kind between viral DNA and host DNA. The identification and interpretation of such shared contiguous subsequences are of substantial interest to biologists. See Waterman (1984) for a review of these methods. These aspects of matching between sequences lead us to ask the following mathematical question: for two independently generated random sequences, what is the distribution of the length of the longest run of contiguous matches? Evolution of nucleotide sequences proceeds by substitution, insertion, and deletion of nucleotides. Substitutions motivate us to study the distribution of the length of the longest contiguous run of matches allowing for a fixed number k of