Data-Parallel String-Manipulating Programs

String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data. A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware. To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations.

[1]  Michael Benedikt,et al.  Automata vs. Logics on Data Words , 2010, CSL.

[2]  Yasuhiko Minamide,et al.  Static approximation of dynamically generated Web pages , 2005, WWW '05.

[3]  Wolfram Schulte,et al.  Data-parallel finite-state machines , 2014, ASPLOS.

[4]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[5]  Hiroshi Inamura,et al.  Dynamic test input generation for web applications , 2008, ISSTA '08.

[6]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[7]  Thomas Schwentick,et al.  Finite state machines for strings over infinite alphabets , 2004, TOCL.

[8]  Benjamin Livshits,et al.  SCRIPTGARD: Preventing Script Injection Attacks in Legacy Web Applications with Automatic Sanitization , 2010 .

[9]  Domagoj Babic,et al.  Sigma*: symbolic learning of input-output specifications , 2013, POPL.

[10]  Christopher Krügel,et al.  Saner: Composing Static and Dynamic Analysis to Validate Sanitization in Web Applications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[11]  Wolfram Schulte,et al.  Maine: A Library for Data Parallel Finite Automata , 2012 .

[12]  Jesper Larsson Träff,et al.  Parallel Prefix (Scan) Algorithms for MPI , 2006, PVM/MPI.

[13]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[14]  Benjamin Livshits,et al.  Fast and Precise Sanitizer Analysis with BEK , 2011, USENIX Security Symposium.

[15]  Luc Segoufin Automata and Logics for Words and Trees over an Infinite Alphabet , 2006, CSL.

[16]  Philip Wadler,et al.  Deforestation: Transforming Programs to Eliminate Trees , 1990, Theor. Comput. Sci..

[17]  G. Broll,et al.  Microsoft Corporation , 1999 .

[18]  Sergey Bereg,et al.  Monadic Decomposition , 2014, CAV.

[19]  Collin Jackson,et al.  Regular expressions considered harmful in client-side XSS filters , 2010, WWW '10.

[20]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[21]  Nikolaj Bjørner,et al.  Symbolic finite state transducers: algorithms and applications , 2012, POPL '12.

[22]  Leonid Libkin,et al.  Variable independence for first-order definable constraints , 2003, TOCL.

[23]  Nissim Francez,et al.  Finite-Memory Automata , 1994, Theor. Comput. Sci..

[24]  Michael D. Ernst,et al.  HAMPI: a solver for string constraints , 2009, ISSTA.

[25]  Loris D'Antoni,et al.  Minimization of symbolic automata , 2014, POPL.

[26]  Nikolaj Bjørner,et al.  Satisfiability modulo theories , 2011, Commun. ACM.

[27]  Oscar H. Ibarra,et al.  Relational String Verification Using Multi-Track Automata , 2011, Int. J. Found. Comput. Sci..

[28]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[29]  Gertjan van Noord,et al.  Finite State Transducers with Predicates and Identities , 2001, Grammars.

[30]  Pavol Cerný,et al.  Streaming transducers for algorithmic verification of single-pass list-processing programs , 2010, POPL '11.

[31]  Loris D'Antoni,et al.  Static Analysis of String Encoders and Decoders , 2013, VMCAI.

[32]  Steve Hanna,et al.  A Symbolic Execution Framework for JavaScript , 2010, 2010 IEEE Symposium on Security and Privacy.

[33]  Westley Weimer,et al.  A decision procedure for subset constraints over regular languages , 2009, PLDI '09.

[34]  Bertrand Jeannet,et al.  Lattice Automata: A Representation for Languages on Infinite Alphabets, and Some Applications to Verification , 2007, SAS.

[35]  Patrice Godefroid,et al.  Compositional dynamic test generation , 2007, POPL '07.

[36]  Thomas Schwentick,et al.  Two-Variable Logic on Words with Data , 2006, 21st Annual IEEE Symposium on Logic in Computer Science (LICS'06).

[37]  David Zhang,et al.  A lightweight streaming layer for multicore execution , 2008, CARN.