论文信息 - Speeding up pattern matching by optimal partial string extraction

Speeding up pattern matching by optimal partial string extraction

String matching plays a key role in web content monitoring systems. Suffix matching algorithms have good time efficiency, and thus are widely used. These algorithms require that all patterns in a set have the same length. When the patterns cannot satisfy this requirement, the leftmost characters, m being the length of the shortest pattern, are extracted to construct the data structure. We call such -character strings partial strings. However, a simple extraction from the left does not address the impact of partial string locations on search speed. We propose a novel method to extract the partial strings from each pattern which maximizes search speed. More specifically, with this method we can compute all the corresponding searching time cost by theoretical derivation, and choose the location which yields an approximately minimal search time. We evaluate our method on two rule sets: Snort and ClamAV. Experiments show that in most cases, our method achieves the fastest searching speed in all possible locations of partial string extraction, and is about 5%–20% faster than the alternative methods.

Xia Liu | Ping Liu | Jianlong Tan | Yanbing Liu

[1] Alfred V. Aho,et al. The Design and Analysis of Computer Algorithms , 1974 .

[2] Maxime Crochemore,et al. Efficient Experimental String Matching by Weak Factor Recognition , 2001, CPM.

[3] Udi Manber,et al. A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[4] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[5] TarhioJorma,et al. Multipattern string matching with q-grams , 2007 .

[6] Gonzalo Navarro,et al. Speeding Up Pattern Matching by Text Sampling , 2008, SPIRE.

[7] Anat Bremler-Barr,et al. CompactDFA: Generic State Machine Compression for Scalable Pattern Matching , 2010, 2010 Proceedings IEEE INFOCOM.

[8] Gonzalo Navarro,et al. Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .