Compressed pattern matching for SEQUITUR

SEQUITUR due to Nevill-Manning and Witten (see Journal of Artificial Intelligence Research, vol.7, p.67-82, 1997) is a powerful program to infer a phrase hierarchy from the input text, that also provides extremely effective compression of large quantities of semi-structured text. In this paper, we address the problem of searching in SEQUITUR compressed text directly. We show a compressed pattern matching algorithm that finds a pattern in compressed text without explicit decompression. We show that our algorithm is approximately 1.27 times faster than a decompression followed by an ordinal search.

[1]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[2]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[3]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[4]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[5]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[6]  Ayumi Shinohara,et al.  Bit-parallel approach to approximate string matching in compressed texts , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[7]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[8]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[9]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[10]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[11]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[12]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[13]  Ayumi Shinohara,et al.  Multiple pattern matching in LZW compressed text , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[14]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[15]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[16]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.

[17]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[18]  Dan R. Olsen,et al.  Compressing semi-structured text using hierarchical phrase identifications , 1996, Proceedings of Data Compression Conference - DCC '96.

[19]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[20]  Ian H. Witten,et al.  Phrase hierarchy inference and compression in bounded space , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[21]  Ayumi Shinohara,et al.  Multiple Pattern Matching Algorithms on Collage System , 2001, CPM.

[22]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.

[23]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.