Generalized Aho-Corasick Algorithm for Signature Based Anti-Virus Applications

Because of its accuracy, signature matching is considered an important technique in anti-virus/worm applications. Among some famous pattern matching algorithms, the Aho-Corasick (AC) algorithm can match multiple patterns simultaneously and guarantee deterministic performance under all circumstances and thus is widely adopted in various systems, especially when worst-case performance such as wire speed requirement is a design factor. However, the AC algorithm was developed only for strings while virus/worm signatures could be specified by simple regular expressions. In this paper, we generalize the AC algorithm to systematically construct a finite state pattern matching machine which can indicate the ending position in a finite input string for the first occurrence of virus/worm signatures that are specified by strings or simple regular expressions. The regular expressions studied in this paper may contain the following operators: * (match any number of symbols), ? (match any symbol), and {min, max} (match minimum of min, maximum of max symbols), which are defined in ClamAV, a popular open source anti-virus/worm software module, for signature specification.

[1]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[2]  Stuart E. Schechter,et al.  Fast Detection of Scanning Worm Infections , 2004, RAID.

[3]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[4]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.