Solving string constraints with Regex-dependent functions through transducers with priorities and variables

Regular expressions are a classical concept in formal language theory. Regular expressions in programming languages (RegEx) such as JavaScript, feature non-standard semantics of operators (e.g. greedy/lazy Kleene star), as well as additional features such as capturing groups and references. While symbolic execution of programs containing RegExes appeals to string solvers natively supporting important features of RegEx, such a string solver is hitherto missing. In this paper, we propose the first string theory and string solver that natively provides such support. The key idea of our string solver is to introduce a new automata model, called prioritized streaming string transducers (PSST), to formalize the semantics of RegEx-dependent string functions. PSSTs combine priorities, which have previously been introduced in prioritized finite-state automata to capture greedy/lazy semantics, with string variables as in streaming string transducers to model capturing groups. We validate the consistency of the formal semantics with the actual JavaScript semantics by extensive experiments. Furthermore, to solve the string constraints, we show that PSSTs enjoy nice closure and algorithmic properties, in particular, the regularity-preserving property (i.e., pre-images of regular constraints under PSSTs are regular), and introduce a sound sequent calculus that exploits these properties and performs propagation of regular constraints by means of taking post-images or pre-images. Although the satisfiability of the string constraint language is generally undecidable, we show that our approach is complete for the so-called straight-line fragment. We evaluate the performance of our string solver on over 195000 string constraints generated from an open-source RegEx library. The experimental results show the efficacy of our approach, drastically improving the existing methods (via symbolic execution) in both precision and efficiency.

[1]  John Harrison,et al.  Handbook of Practical Logic and Automated Reasoning , 2009 .

[2]  Cesare Tinelli,et al.  A DPLL(T) Theory Solver for a Theory of Strings and Regular Expressions , 2014, CAV.

[3]  Philipp Rümmer,et al.  String constraints with concatenation and transducers solved efficiently , 2017, Proc. ACM Program. Lang..

[4]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[5]  Dominik D. Freydenberger Extended Regular Expressions: Succinctness and Decidability , 2012, Theory of Computing Systems.

[6]  Quang Loc Le,et al.  A Decision Procedure for String Logic with Quadratic Equations, Regular Expressions and Length Constraints , 2018, APLAS.

[7]  Parosh Aziz Abdulla,et al.  String Constraints for Verification , 2014, CAV.

[8]  Anthony Widjaja Lin,et al.  String solving with word equations and transducers: towards a logic for analysing mutation XSS , 2015, POPL.

[9]  Michael D. Ernst,et al.  HAMPI: A solver for word equations over strings, regular expressions, and context-free grammars , 2012, TSEM.

[10]  Joxan Jaffar,et al.  S3: A Symbolic String Solver for Vulnerability Detection in Web Applications , 2014, CCS.

[11]  Yan Chen,et al.  What Is Decidable about String Constraints with the ReplaceAll Function , 2017, 1711.03363.

[12]  Jie-Hong Roland Jiang,et al.  A Symbolic Model Checking Approach to the Analysis of String and Length Constraints , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13]  Johannes Kinder,et al.  ExpoSE: practical symbolic execution of standalone JavaScript , 2017, SPIN.

[14]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[15]  Peter J. Stuckey,et al.  Constraint Programming for Dynamic Symbolic Execution of JavaScript , 2019, CPAIOR.

[16]  Parosh Aziz Abdulla,et al.  Chain-Free String Constraints , 2019, ATVA.

[17]  A. K. Chandra,et al.  Alternation , 1976, 17th Annual Symposium on Foundations of Computer Science (sfcs 1976).

[18]  Joost Engelfriet,et al.  Iterated Stack Automata and Complexity Classes , 1991, Inf. Comput..

[19]  Sheng Yu,et al.  A Formal Study Of Practical Regular Expressions , 2003, Int. J. Found. Comput. Sci..

[20]  Parosh Aziz Abdulla,et al.  Flatten and conquer: a framework for efficient analysis of string constraints , 2017, PLDI.

[21]  Fang Yu,et al.  Stranger: An Automata-Based String Analysis Tool for PHP , 2010, TACAS.

[22]  Pierre Flener,et al.  Design and Implementation of Bounded-Length Sequence Variables , 2017, CPAIOR.

[23]  Swarat Chaudhuri,et al.  Computer Aided Verification , 2016, Lecture Notes in Computer Science.

[24]  Parosh Aziz Abdulla,et al.  Trau: SMT solver for string constraints , 2018, 2018 Formal Methods in Computer Aided Design (FMCAD).

[25]  Xiangyu Zhang,et al.  Effective Search-Space Pruning for Solvers of String Equations, Regular Expressions and Length Constraints , 2015, CAV.

[26]  Dominik D. Freydenberger,et al.  Deterministic regular expressions with back-references , 2019, J. Comput. Syst. Sci..

[27]  Francisco Servant,et al.  Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions , 2019, ESEC/SIGSOFT FSE.

[28]  Rajeev Alur,et al.  Nondeterministic Streaming String Transducers , 2011, ICALP.

[29]  Dominik D. Freydenberger,et al.  Deterministic Regular Expressions with Back-References , 2018, STACS.

[30]  Philipp Rümmer,et al.  Decision procedures for path feasibility of string-manipulating programs with complex operations , 2018, Proc. ACM Program. Lang..

[31]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[32]  Pavol Cerný,et al.  Expressiveness of streaming string transducers , 2010, FSTTCS.

[33]  Brink van der Merwe,et al.  Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching , 2014, AFL.

[34]  Xiangyu Zhang,et al.  Z3-str: a z3-based string solver for web application analysis , 2013, ESEC/FSE 2013.

[35]  Michael Pradel,et al.  Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers , 2018, USENIX Security Symposium.

[36]  Arlen Cox,et al.  Model Checking Regular Language Constraints , 2017, ArXiv.

[37]  Cesare Tinelli,et al.  Solving SAT and SAT Modulo Theories: From an abstract Davis--Putnam--Logemann--Loveland procedure to DPLL(T) , 2006, JACM.

[38]  Vijay Ganesh,et al.  Undecidability of a Theory of Strings, Linear Arithmetic over Length, and String-Number Conversion , 2016, ArXiv.

[39]  Oscar H. Ibarra,et al.  Automata-based symbolic string analysis for vulnerability detection , 2014, Formal Methods Syst. Des..

[40]  Steve Hanna,et al.  A Symbolic Execution Framework for JavaScript , 2010, 2010 IEEE Symposium on Security and Privacy.

[41]  Jie-Hong Roland Jiang,et al.  String Analysis via Automata Manipulation with Logic Circuit Representation , 2016, CAV.

[42]  Yunhui Zheng,et al.  ZSstrS: A string solver with theory-aware heuristics , 2017, 2017 Formal Methods in Computer Aided Design (FMCAD).

[43]  Murphy Berzish Z3str4: A Solver for Theories over Strings , 2021 .

[44]  Johannes Kinder,et al.  Sound regular expression semantics for dynamic symbolic execution of JavaScript , 2018, PLDI.

[45]  Brink van der Merwe,et al.  On the semantics of regular expression parsing in the wild , 2017, Theor. Comput. Sci..

[46]  Paliath Narendran,et al.  On Extended Regular Expressions , 2009, LATA.

[47]  YuFang,et al.  Automata-based symbolic string analysis for vulnerability detection , 2014 .

[48]  Joxan Jaffar,et al.  Progressive Reasoning over Recursively-Defined Strings , 2016, CAV.

[49]  Bell Telephone,et al.  Regular Expression Search Algorithm , 1968 .

[50]  Olivier Carton,et al.  Decision problems among the main subfamilies of rational relations , 2006, RAIRO Theor. Informatics Appl..

[51]  Armando Solar-Lezama,et al.  Word Equations with Length Constraints: What's Decidable? , 2012, Haifa Verification Conference.

[52]  Yasuhiko Minamide,et al.  Solving String Constraints with Streaming String Transducers , 2019, J. Inf. Process..

[53]  Peter J. Stuckey,et al.  A Novel Approach to String Constraint Solving , 2017, CP.

[54]  Markus L. Schmid Characterising REGEX languages by regular languages equipped with factor-referencing , 2016, Inf. Comput..

[55]  M. E. Szabo,et al.  The collected papers of Gerhard Gentzen , 1969 .

[56]  Benjamin Livshits,et al.  Fast and Precise Sanitizer Analysis with BEK , 2011, USENIX Security Symposium.

[57]  Francisco Servant,et al.  Regexes are Hard: Decision-Making, Difficulties, and Risks in Programming Regular Expressions , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[58]  R. Amadini A Survey on String Constraint Solving , 2020, ACM Comput. Surv..

[59]  Brink van der Merwe,et al.  Regular Expressions with Backreferences Re-examined , 2017, Stringology.