Exploiting Structure in Regular Expression Queries

Regular expression, or regex, is widely used to extract critical information from a large corpus of formatted text by finding patterns of interest. In tasks like log processing, the speed of regex matching is crucial. Data scientists and developers regularly use regex libraries that implement optimized regular expression matching using modern automata theory. However, computing state transitions in the underlying regex evaluation engine can be inefficient when a regex query contains a multitude of string literals. This inefficiency is further exasperated when analyzing large data volumes. This paper presents BLARE, Blazingly Fast Regular Expression, a regular expression matching framework that is inspired by the mechanisms that are used in database engines, which use a declarative framework to explore multiple equivalent execution plans, all of which produce the correct final result. Similarly, BLARE decomposes a regex into multiple regex and string components and then creates evaluation strategies in which the components can be evaluated in an order that is not strictly a left-to-right translation of the input regex query. Rather than using a cost-based optimization approach, BLARE uses an adaptive runtime strategy based on a multi-armed bandit approach to find an efficient execution plan. BLARE is also modular and can be built on top of any existing regex library. We implemented BLARE on four commonly used regex libraries, RE2, PCRE2, Boost Regex, and ICU Regex, and evaluated it using two production workloads and one open-source workload. BLARE was 1.6× to 3.7× faster than RE2 and 3.4× to 7.9× faster than Boost Regex. PCRE2 did not finish on one of the workloads, but on the remaining two workloads, BLARE improved the performance of PCRE2 by 3.1× to over 100×. For the open-source dataset, BLARE provided a speed up of 61.7× for ICU Regex. BLARE code is publicly available at https://github.com/mush-zhang/Blare.

[1]  Zhiliang Wang,et al.  Bolt: Scalable and Cost-Efficient Multistring Pattern Matching With Programmable Switches , 2023, IEEE/ACM Transactions on Networking.

[2]  Orri Erling,et al.  Velox , 2022, Proceedings of the VLDB Endowment.

[3]  W. Martens Towards Theory for Real-World Data , 2022, PODS.

[4]  Sang-Woo Jun,et al.  MithriLog: Near-Storage Accelerator for High-Performance Log Analytics , 2021, MICRO.

[5]  Marco D. Santambrogio,et al.  CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching , 2021, ACM Trans. Embed. Comput. Syst..

[6]  Michael Pradel,et al.  Finding data compatibility bugs with JSON subschema checking , 2021, ISSTA.

[7]  Karrie Karahalios,et al.  From Sketching to Natural Language , 2021, SIGMOD Rec..

[8]  Sharma V. Thankachan,et al.  Text Indexing for Regular Expression Matching , 2021, Algorithms.

[9]  Jonathan M. Smith,et al.  DeepMatch: practical deep packet inspection in the data plane using network processors , 2020, CoNEXT.

[10]  Wim Martens,et al.  Bridging Theory and Practice with Query Log Analysis , 2019, SGMD.

[11]  Srinivasan Parthasarathy,et al.  Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights , 2019, SIGSPATIAL/GIS.

[12]  Jiaheng Lu,et al.  Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems , 2019, Proc. VLDB Endow..

[13]  Harry Chang,et al.  Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs , 2019, NSDI.

[14]  Gustavo Alonso,et al.  Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures , 2017, SIGMOD Conference.

[15]  Benny Kimelfeld,et al.  Joining Extractions of Regular Expressions , 2017, PODS.

[16]  Guofei Jiang,et al.  LogMine: Fast Pattern Recognition for Log Analytics , 2016, CIKM.

[17]  Thomas F. Wenisch,et al.  HARE: Hardware accelerator for regular expressions , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Siu-Ming Yiu,et al.  A Survey on Regular Expression Matching for Deep Packet Inspection: Applications, Algorithms, and Hardware Platforms , 2016, IEEE Communications Surveys & Tutorials.

[19]  Kenneth A. Ross,et al.  SIMD-accelerated regular expression matching , 2016, DaMoN '16.

[20]  Lei Jiang,et al.  PiDFA: A practical multi-stride regular expression matching engine based On FPGA , 2016, 2016 IEEE International Conference on Communications (ICC).

[21]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[22]  Li Zhou,et al.  A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[23]  Judith Kelner,et al.  Design and optimizations for efficient regular expression matching in DPI systems , 2015, Comput. Commun..

[24]  Raihan Ur Rasool,et al.  Multi-byte Pattern Matching Using Stride-K DFA for High Speed Deep Packet Inspection , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[25]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[26]  Yang Song,et al.  TFA: A Tunable Finite Automaton for Pattern Matching in Network Intrusion Detection Systems , 2014, IEEE Journal on Selected Areas in Communications.

[27]  Li Guo,et al.  Towards Fast and Optimal Grouping of Regular Expressions via DFA Size Estimation , 2014, IEEE Journal on Selected Areas in Communications.

[28]  Eric Torng,et al.  An overlay automata approach to regular expression matching , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[29]  Dafang Zhang,et al.  Scalable TCAM-based regular expression matching with compressed finite automata , 2013, Architectures for Networking and Communications Systems.

[30]  Patrick Crowley,et al.  A-DFA: A Time- and Space-Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation , 2013, TACO.

[31]  Christoph Hagleitner,et al.  Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Sungryoul Lee,et al.  Kargus: a highly-scalable software-based intrusion detection system , 2012, CCS.

[33]  Katja Losemann,et al.  Foundations of regular expressions in XML schema languages and SPARQL , 2012, PhD '12.

[34]  Peter T. Wood,et al.  Query languages for graph databases , 2012, SGMD.

[35]  Judith Kelner,et al.  Deterministic Finite Automaton for scalable traffic identification: The power of compressing by range , 2012, 2012 IEEE Network Operations and Management Symposium.

[36]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[37]  Monther Aldwairi,et al.  Exscind: Fast pattern matching for intrusion detection using exclusion and inclusion filters , 2011, 2011 7th International Conference on Next Generation Web Services Practices.

[38]  Sanjay Chawla,et al.  A robust index for regular expression queries , 2011, CIKM '11.

[39]  Sotiris Ioannidis,et al.  MIDeA: a multi-parallel intrusion detection architecture , 2011, CCS '11.

[40]  Kai Wang,et al.  Reorganized and Compact DFA for Efficient Regular Expression Matching , 2011, 2011 IEEE International Conference on Communications (ICC).

[41]  Jianzhong Li,et al.  Adding regular expressions to graph reachability and pattern queries , 2011, Frontiers of Computer Science.

[42]  H. Jonathan Chao,et al.  Range hash for regular expression pre-filtering , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[43]  Jignesh M. Patel,et al.  SigMatch: Fast and Scalable Multi-Pattern Matching , 2010, Proc. VLDB Endow..

[44]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[45]  David Brumley,et al.  SplitScreen: Enabling efficient, distributed malware detection , 2010, Journal of Communications and Networks.

[46]  Michela Becchi,et al.  Evaluating regular expression matching engines on network and general purpose processors , 2009, ANCS '09.

[47]  Randy Smith,et al.  Efficient signature matching with multiple alphabet compression tables , 2008, SecureComm.

[48]  Stefanie Scherzinger,et al.  XML Prefiltering as a String Matching Problem , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[49]  Laxmi N. Bhuyan,et al.  Compiling PCRE to FPGA for accelerating SNORT IDS , 2007, ANCS '07.

[50]  George Varghese,et al.  Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia , 2007, ANCS '07.

[51]  Patrick Crowley,et al.  An improved algorithm to accelerate regular expression evaluation , 2007, ANCS '07.

[52]  Thomas Schwentick,et al.  Simple off the shelf abstractions for XML schema , 2007, SGMD.

[53]  Srihari Cadambi,et al.  Memory-Efficient Regular Expression Search Using State Merging , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[54]  T. V. Lakshman,et al.  Fast and memory-efficient regular expression matching for deep packet inspection , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[55]  Patrick Crowley,et al.  Algorithms to accelerate multiple regular expressions matching for deep packet inspection , 2006, SIGCOMM.

[56]  Ron K. Cytron,et al.  A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[57]  Jan van Lunteren,et al.  High-Performance Pattern-Matching for Intrusion Detection , 2006, INFOCOM.

[58]  J. Patel,et al.  Declarative Querying for Biological Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[59]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[60]  Nen-Fu Huang,et al.  A fast string-matching algorithm for network processor-based intrusion detection system , 2004, TECS.

[61]  Luca Cardelli,et al.  Greedy Regular Expression Matching , 2004, ICALP.

[62]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[63]  George Varghese,et al.  Deterministic memory-efficient string matching algorithms for intrusion detection , 2004, IEEE INFOCOM 2004.

[64]  Viktor K. Prasanna,et al.  Time and area efficient pattern matching on FPGAs , 2004, FPGA '04.

[65]  Hao Zhang,et al.  Path sharing and predicate evaluation for high-performance XML filtering , 2003, TODS.

[66]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[67]  Rajeev Rastogi,et al.  RE-tree: an efficient index structure for regular expressions , 2003, The VLDB Journal.

[68]  Jignesh M. Patel,et al.  PiQA: an algebra for querying protein data sets , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[69]  Stijn Vansummeren,et al.  Unique Pattern Matching in Strings , 2003, ArXiv.

[70]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.

[71]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[72]  Jignesh M. Patel,et al.  Searching on the Secondary Structure of Protein Sequences , 2002, VLDB.

[73]  Yanlei Diao,et al.  YFilter: efficient and scalable filtering of XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[74]  Junghoo Cho,et al.  A fast regular expression indexing engine , 2002, Proceedings 18th International Conference on Data Engineering.

[75]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[76]  Diego Calvanese,et al.  Rewriting of regular expressions and regular path queries , 1999, PODS '99.

[77]  Dan Suciu,et al.  Query containment for conjunctive queries with regular expressions , 1998, PODS.

[78]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[79]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[80]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[81]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[82]  Orri Erling,et al.  Velox: Meta's Unified Execution Engine , 2022, Proc. VLDB Endow..

[83]  Michael Stonebraker,et al.  DICE: Data Discovery by Example , 2021, Proc. VLDB Endow..

[84]  Tarique Siddiqui From Sketching to Natural Language: Expressive Visual Querying for Accelerating Insight , 2021 .

[85]  Vyas Sekar,et al.  Achieving 100Gbps Intrusion Prevention on a Single Server , 2020, OSDI.

[86]  Hiroyuki Kitagawa,et al.  Accelerating Regular Path Queries using FPGA , 2019, ADMS@VLDB.

[87]  Wim Martens,et al.  Evaluation and Enumeration Problems for Regular Path Queries , 2018, ICDT.

[88]  Zeyu Li,et al.  Repairing Data through Regular Expressions , 2016, Proc. VLDB Endow..

[89]  Pierre Nicodème,et al.  Regexpcount, a symbolic package for counting problems on regular expressions and words , 2000, Fundam. Informaticae.

[90]  Byron Choi,et al.  What are real DTDs like? , 2002, WebDB.