Fast pattern matching for entropy bounded text

We present the first known case of one-dimensional and two-dimensional string matching algorithms for text with bounded entropy. Let n be the length of the text and m be the length of the pattern. We show that the expected complexity of the algorithms is related to the entropy of the text for various assumptions of the distribution of the pattern. For the case of uniformly distributed patterns, our one dimensional matching algorithm works in O(nlogm/(pm)) expected running time where H is the entropy of the text and p=1-(1-H/sup 2/)/sup H/(1+H)/. The worst case running time T can also be bounded by (n log m/p(m+/spl radic/V))/spl les/T/spl les/(n log m/p(m-/spl radic/V)) if V is the variance of the source from which the pattern is generated. Our algorithm utilizes data structures and probabilistic analysis techniques that are found in certain lossless data compression schemes.

[1]  Andrew Chi-Chih Yao,et al.  The Complexity of Pattern Matching for a Random String , 1977, SIAM J. Comput..

[2]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[3]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[4]  Richard Cole,et al.  Tighter bounds on the exact complexity of string matching , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[5]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[6]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[7]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[8]  S. Muthukrishnan,et al.  Highly efficient dictionary matching in parallel , 1993, SPAA '93.

[9]  Wojciech Rytter,et al.  Two-Dimensional Pattern Matching by Sampling , 1993, Inf. Process. Lett..

[10]  Frank Rubin,et al.  Experiments in text file compression , 1976, CACM.

[11]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[12]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[13]  Tamas Linder,et al.  Universality and rates of convergence in lossy source coding , 1993, [Proceedings] DCC `93: Data Compression Conference.

[14]  Zvi Galil A Constant-Time Optimal Parallel String-Matching Algorithm , 1995, J. ACM.

[15]  Jukka Teuhola,et al.  Predictive test compression by hashing , 1987, SIGIR '87.

[16]  John H. Reif,et al.  Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[17]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[18]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[19]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[20]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[21]  Malcolm C. Harrison,et al.  Implementation of the substring test by hashing , 1971, CACM.