Data structures, algorithms and architectures for efficient regular expression evaluation

Regular expression matching is a crucial task in several networking applications, where packet payloads need to be inspected at line rate over large sets of complex patterns. Modern, payload-scanning firewalls and intrusion detection systems are perhaps the most notable applications that rely on this mechanism. When targeting memory-based architectures, in which the pattern data structure is stored in off-chip memory such as DRAM, the design of efficient regular expression matching engines is subject to two basic resource requirements: memory size and memory bandwidth. When designing Field Programmable Gate Array implementations, in which the pattern data structure is stored in on-chip memories or logic gates, the focus is on minimizing logic cell utilization and allowing high operational dock frequency. Keeping these requirements feasible is a challenge as pattern sets grow in size and expressiveness. In this work, we describe novel data structures and algorithms for high-speed regular expression evaluation and map the proposed schemes onto suitable parallel architectures. We aim to provide a comprehensive solution that is scalable with respect to both the size of the rule-set and the complexity of the individual rules. Our main contributions are the following. First, we propose an efficient and low complexity compression scheme for deterministic finite automata (DFAs), leading to a compression factor in excess of 98% across a set of representative data sets. Second, we identify and address problematic subpatterns, such as bounded and unbounded repetitions of wildcards and large character sets and back-references. Specifically, we propose a novel automaton with a limited memory footprint and a bounded memory bandwidth requirement that overcomes the limitations of pure DFA-based solutions. Third, we describe how to utilize our proposed techniques when targeting FPGA implementations. Finally, our tools for constructing and analyzing a range of alternative automata for regular-expression evaluation have been released as open-source, and are used actively by many researchers from around the world.