High-Performance Regular Expression Matching with Parabix and LLVM

This thesis investigates the feasibility of constructing a high-performance, Unicode-capable, regular expression search tool by combining parallel bit stream technologies and algorithms together with the dynamic compilation capabilities of the LLVM compiler infrastructure. A prototype implementation of icGREP successfully demonstrates the feasibility of this undertaking, with asymptotic performance fully in line with that predicted by earlier prototyping work. The icGREP implementation extends the Parabix regular expression algorithms to include new techniques for efficient Unicode character matching. Performance evaluations in comparison with other Unicode-capable regular expression search tools show asymptotic performance advantages that are often over 10X, although the overhead of dynamic compilation techniques confines the benefits to relatively large input files.

[1]  Robert D. Cameron,et al.  High performance XML parsing using parallel bit stream technology , 2008, CASCON '08.

[2]  Michael Fitzgerald,et al.  Introducing Regular Expressions - Unraveling Regular Expressions, Step-by-Step , 2012 .

[3]  Michael L. Scott Programming Language Pragmatics, Third Edition , 2009 .

[4]  Michel Dumontier,et al.  Modeling tryptic digestion on the Cell BE processor , 2009, 2009 Canadian Conference on Electrical and Computer Engineering.

[5]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[6]  Mladen Berekovic,et al.  Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization , 2010, ICS '10.

[7]  Martin C. Brown Perl: The Complete Reference , 1999 .

[8]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[9]  Robert D. Cameron,et al.  Parabix: Boosting the efficiency of text processing on commodity processors , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[10]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[11]  Richard T. Gillam Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard , 2002 .

[12]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[13]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[14]  Wolfram Schulte,et al.  Data-parallel finite-state machines , 2014, ASPLOS.

[15]  Viktor K. Prasanna,et al.  Fast Regular Expression Matching Using FPGAs , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[16]  Fabrizio Petrini,et al.  Tools for Very Fast Regular Expression Matching , 2010, Computer.

[17]  Chris Lattner,et al.  LLVM: AN INFRASTRUCTURE FOR MULTI-STAGE OPTIMIZATION , 2000 .

[18]  Tony Abou-Assaleh,et al.  Survey of Global Regular Expression Print (grep) Tools , 2004 .

[19]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[20]  T. V. Lakshman,et al.  Fast and memory-efficient regular expression matching for deep packet inspection , 2006, 2006 Symposium on Architecture For Networking And Communications Systems.

[21]  John Bambenek,et al.  grep - Pocket Reference: the Basics for an Essential Unix Content-Location Utility , 2009 .

[22]  Fabrizio Petrini,et al.  Peak-Performance DFA-based String Matching on the Cell Processor , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Michela Becchi,et al.  Evaluating regular expression matching engines on network and general purpose processors , 2009, ANCS '09.

[24]  Robert D. Cameron A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding , 2008, PPOPP.

[25]  Vikram S. Adve,et al.  The LLVM Instruction Set and Compilation Strategy , 2002 .

[26]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[27]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.