Is stateful packrat parsing really linear in practice? a counter-example, an improved grammar, and its parsing algorithms

Stateful packrat parsing is an algorithm for parsing syntaxes that have context-sensitive features. It is a well-known knowledge among researchers that the running time of stateful packrat parsing is linear for real-world grammars, as demonstrated in existing studies. However, we have found the cases in real-world grammars and tools that lead its running time to become exponential. This paper proposes a new grammar, parsing expression grammar with variable bindings, and two parsing algorithms for the grammar, stateful packrat parsing with selected global states and stateful packrat parsing with conditional memoization. Our proposal overcomes the exponential behavior that appears in parsers and guarantees polynomial running time. The key idea behind our algorithms is to memoize the information relevant to the use of the global states in order to avoid memoizing the full global states. We implemented our algorithms as a parser generator and evaluated them on real-world grammars. Our evaluation shows that our algorithms significantly outperform an existing stateful packrat parsing algorithm in terms of both running time and space consumption. In particular, stateful packrat parsing with conditional memoization improves the running time and space consumption for malicious inputs that lead to exponential behavior with the existing algorithm by 260x and 217x, respectively, compared to the existing algorithm.

[1]  Jan van Eijck,et al.  Sequentially Indexed Grammars , 2008, J. Log. Comput..

[2]  Bryan Ford,et al.  Parsing expression grammars: a recognition-based syntactic foundation , 2004, POPL '04.

[3]  Francisco Servant,et al.  The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale , 2018, ESEC/SIGSOFT FSE.

[4]  AdamsKeith,et al.  The hiphop virtual machine , 2014 .

[5]  David Walker,et al.  Semantics and algorithms for data-dependent grammars , 2010, POPL '10.

[6]  Francisco Servant,et al.  Using Selective Memoization to Defeat Regular Expression Denial of Service (ReDoS) , 2021, 2021 IEEE Symposium on Security and Privacy (SP).

[7]  Kurt Mehlhorn,et al.  Parsing Macro Grammars Top Down , 1979, Inf. Control..

[8]  Kimio Kuramitsu A symbol-based extension of parsing expression grammars and context-sensitive packrat parsing , 2017, SLE.

[9]  Bryan Ford,et al.  Packrat parsing:: simple, powerful, lazy, linear time, functional pearl , 2002, ICFP '02.

[10]  Terence Parr,et al.  Adaptive LL(*) parsing: the power of dynamic analysis , 2014, OOPSLA 2014.

[11]  Robert Grimm,et al.  Better extensibility through modular syntax , 2006, PLDI '06.

[12]  Jörg Schwenk,et al.  SoK: XML Parser Vulnerabilities , 2016, WOOT.

[13]  Jian Lu,et al.  ReScue: Crafting Regular Expression DoS Attacks* , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  James R. Cordy,et al.  A backtracking LR algorithm for parsing ambiguous context-dependent languages , 2006, CASCON.

[15]  Michael D. Adams Principled parsing for indentation-sensitive languages: revisiting landin's offside rule , 2013, POPL.

[16]  Kim Mens,et al.  Taming context-sensitive languages with principled stateful parsing , 2016, SLE.

[17]  Alfred V. Aho,et al.  Indexed Grammars—An Extension of Context-Free Grammars , 1967, SWAT.

[18]  Alessandro Warth,et al.  OMeta: an object-oriented language for pattern matching , 2007, DLS '07.

[19]  Michael D. Adams,et al.  Indentation-sensitive parsing for Parsec , 2014, Haskell '14.

[20]  Bryan Ford,et al.  Packet parsing : a practical linear-time algorithm with backtracking , 2002 .