Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

Let a seed , S , be a string from the alphabet {1,*}, of arbitrary length k , which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S . We say that the 1s at the corresponding positions in T are covered . We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p . We refer to this new probability distribution as C nSp , for covered , with S being the seed. We present an efficient method to calculate this distribution exactly . Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.