Efficient Regular Expression Matching on Compressed Strings

Existing methods for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as high compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., a prefix and a suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant suffix trie index, called SSLZ. In addition, we construct bitmaps for factors of regular expression to detect potential region and propose block filtering to reduce candidates. At last, we conduct a comprehensive performance evaluation using five real datasets to validate our ideas and the proposed algorithms. The experimental result shows that our RELZ algorithm outperforms the existing algorithms significantly.

[1]  Gonzalo Navarro,et al.  Compact DFA Representation for Fast Regular Expression Search , 2001, Algorithm Engineering.

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Bin Wang,et al.  Negative Factor , 2016, ACM Trans. Database Syst..

[4]  Bin Wang,et al.  Improving regular-expression matching on strings using negative factors , 2013, SIGMOD '13.

[5]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[6]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[7]  Mathieu Raffinot,et al.  Fast Regular Expression Search , 1999, WAE.

[8]  Gonzalo Navarro,et al.  Regular expression searching on compressed text , 2003, J. Discrete Algorithms.

[9]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[10]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[11]  Karl Aberer,et al.  Time- and Space-Efficient Sliding Window Top-k Query Processing , 2015, TODS.

[12]  Gonzalo Navarro,et al.  NR‐grep: a fast and flexible pattern‐matching tool , 2001, Softw. Pract. Exp..

[13]  Zeyu Li,et al.  Repairing Data through Regular Expressions , 2016, Proc. VLDB Endow..

[14]  Meng Zhang,et al.  Compact representations of automata for regular expression matching , 2016, Inf. Process. Lett..

[15]  Philip Bille,et al.  Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts , 2007, CPM.

[16]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[17]  Bin Wang,et al.  Efficient direct search on compressed genomic data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).