q-gram matching using tree models

q-gram matching is used for approximate substring matching problems in a wide range of application areas, including intrusion detection. In this paper, we present a tree-based model to perform fast linear time q-gram matching. All q-grams present in the text are stored in a tree structure similar to trie. We use a tree redundancy pruning algorithm to reduce the size of the tree without losing any information. We also use suffix links for fast q-gram search during query matching. We compare our work with the Rabin-Karp-based hash-table technique, commonly used for multiple q-gram search. We present results of experiments on system call sequence data used for intrusion detection.

[1]  Stephanie Forrest,et al.  A sense of self for Unix processes , 1996, Proceedings 1996 IEEE Symposium on Security and Privacy.

[2]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[3]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[4]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[5]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[6]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[7]  Gad M. Landau,et al.  Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[8]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[9]  Esko Ukkonen,et al.  Boyer-Moore Approach to Approximate String Matching (Extended Abstract) , 1990, SWAT.

[10]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[11]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[12]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[13]  Eugene W. Myers,et al.  A subquadratic algorithm for approximate limited expression matching , 2005, Algorithmica.

[14]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[15]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[16]  C.J. Coit,et al.  Towards faster string matching for intrusion detection or exceeding the speed of Snort , 2001, Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01.

[17]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[18]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[19]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[20]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[21]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[22]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[23]  J. Seiferas,et al.  Efficient and Elegant Subword-Tree Construction , 1985 .

[24]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.