A lower-variance randomized algorithm for approximate string matching

Several randomized algorithms make use of convolution to estimate the score vector of matches between a text string of length N and a pattern string of length M, i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position. These algorithms run in deterministic time O(kNlogM), and find an unbiased estimator of the scores whose variance is (M-c)^2/k where c is the actual score; here k is an adjustable parameter that provides a tradeoff between computation time and lower variance. This paper provides an algorithm that also runs in deterministic time O(kNlogM) but achieves a lower variance of min(M/k,M-c)(M-c)/k. For all score values c that are less than M-(M/k), our variance is essentially a factor of k smaller than in previous work, and for M-(M/k)