On The Collapse of the q-Gram Filtration

In the approximate pattern matching problem, the text area t o be searched for an occurrence of a pattern can be pruned by applying a filtr tion condition. A q-gram based filtration condition defines potential text area s in terms of patternq-grams, i.e., strings of length q. A text area will be checked by an accurate method only if the set of the q-grams in the text area satisfies a certain condition. One hopes that the filtration limits the n umber of checks to a minimum, thus making the algorithm quite efficient. Howe ver, computer experiments show that the filtration method works fine f or cases when the allowed error level k is relatively small compared to the pattern length, but loses its efficiency quite sharply with an increasing k. This is aphase transition phenomenon that is quite often observed in nature. In this paper, we present a theoretical explanation for this phenomenon wh ich will excuse us to introduce advanced mathematical analysis based on cer tain languages, correlation polynomials, generating functions and comple x analysis. It is our view that nothing can be more exciting and rewarding than find ing a theoretical justification for an abrupt manifestation of nature.

[1]  Philippe Jacquet,et al.  Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees , 1995, Theor. Comput. Sci..

[2]  Joseph Naus,et al.  Approximations for Distributions of Scan Statistics , 1982 .

[3]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[4]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[7]  Mikhail J. Atallah,et al.  A pattern matching approach to image compression , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[8]  Mireille Régnier,et al.  On the approximate pattern occurrences in a text , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9]  Wojciech Szpankowski,et al.  A suboptimal lossy data compression based on approximate pattern matching , 1997, IEEE Trans. Inf. Theory.

[10]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[11]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[12]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[13]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[14]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[15]  Mireille Régnier,et al.  On Pattern Frequency Occurrences in a Markovian Sequence , 1998, Algorithmica.

[16]  Tadao Takaoka,et al.  Approximate Pattern Matching with Samples , 1994, ISAAC.

[17]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[18]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.