Context-Sensitive Grammar Transform: Compression and Pattern Matching

A framework of context-sensitive grammar transform is proposed. A greedy compression algorithm with the transform model is presented as well as a Knuth-Morris-Pratt (KMP)-type compressed pattern matching (CPM) algorithm. The compression performance is a match for gzip and Re-Pair. The search speed of our CPM algorithm is almost twice faster than the KMP type CPM algorithm on Byte-Pair-Encoding by Shibata et al. (2000), and in the case of short patterns, faster than the Boyer-Moore-Horspool algorithm with the stopper encoding by Rautio et al. (2002), which is regarded as one of the best combinations that allows a practically fast search.

[1]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[2]  Gonzalo Navarro,et al.  Practical and flexible pattern matching over Ziv-Lempel compressed text , 2004, J. Discrete Algorithms.

[3]  Setsuo Arikawa,et al.  A run-time efficient realization of Aho-Corasick pattern matching machines , 2009, New Generation Computing.

[4]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[5]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[6]  Masayuki Takeda,et al.  A Run-Time Efficient Implementation of Compressed Pattern Matching Automata , 2008, CIAA.

[7]  M. Lothaire,et al.  Applied Combinatorics on Words , 2005 .

[8]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[9]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[10]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[11]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[12]  Hiroshi Sakamoto,et al.  A Space-Saving Linear-Time Algorithm for Grammar-Based Compression , 2004, SPIRE.

[13]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[14]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[15]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[16]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[17]  G J Barker,et al.  Diffusion imaging shows abnormalities after blunt head trauma when conventional magnetic resonance imaging is normal , 2001, Journal of neurology, neurosurgery, and psychiatry.

[18]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[19]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[20]  Udi Manber,et al.  A text compression scheme that allows fast searching directly in the compressed file , 1994, TOIS.

[21]  Takuya Kida,et al.  A Space-Saving Approximation Algorithm for Grammar-Based Compression , 2009, IEICE Trans. Inf. Syst..

[22]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[23]  Jorma Tarhio,et al.  String Matching with Stopper Encoding and Code Splitting , 2002, CPM.