Speeding Up Pattern Matching by Text Compression

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-Welch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searching time.

[1]  Wojciech Plandowski,et al.  Constant-Space String Matching with Smaller Number of Comparisons: Sequential Sampling , 1995, CPM.

[2]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[3]  Zvi Galil,et al.  Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[4]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[5]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[6]  Andrew Chi-Chih Yao,et al.  The Complexity of Pattern Matching for a Random String , 1977, SIAM J. Comput..

[7]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[8]  Ricardo A. Baeza-Yates,et al.  Direct pattern matching on compressed text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[9]  Wojciech Plandowski,et al.  The Zooming Method: A Recursive Approach to Time-Space Efficient String-Matching , 1995, Theor. Comput. Sci..

[10]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[11]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[12]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[13]  Maxime Crochemore,et al.  Two-way string-matching , 1991, JACM.

[14]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[15]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[16]  Wojciech Plandowski,et al.  Speeding Up Two String-Matching Algorithms , 1992, STACS.

[17]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[18]  Gerard Zwaan,et al.  A Taxonomy of Sublinear Multiple Keyword Pattern Matching Algorithms , 1996, Sci. Comput. Program..

[19]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[20]  Dany Breslauer,et al.  Saving Comparisons in the Crochemore-Perrin String-Matching Algorithm , 1996, Theor. Comput. Sci..

[21]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[22]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[23]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[24]  Masayuki Takeda An Efficient Multiple String Replacing Algorithm Using Patterns with Pictures , 1991 .

[25]  Setsuo Arikawa,et al.  PATTERN MATCHING MACHINES FOR REPLACING SEVERAL CHARACTER STRINGS , 1984 .

[26]  Udi Manber,et al.  A text compression scheme that allows fast searching directly in the compressed file , 1994, TOIS.