A deterministic parsing algorithm for ambiguous regular expressions

We introduce a new parser generator, called Berry–Sethi Parser (BSP), for ambiguous regular expressions (RE). The generator constructs a deterministic finite-state transducer that recognizes an input string, as the classical Berry–Sethi algorithm does, and additionally outputs a linear representation of all the syntax trees of the string; for infinitely ambiguous strings, a policy for selecting representative sets of trees is chosen. To construct the transducer, the RE symbols, including letters, parentheses and other metasymbols, are distinctly numbered, so that the corresponding language becomes locally testable. In this way a deterministic position automaton can be constructed, which recognizes and translates the input into a compact DAG representation of the syntax trees. The correctness of the construction is proved. The transducer operates in a linear time on the input. Its descriptive complexity is analyzed as a function of established RE parameters: the alphabetic width, the number of null string symbols and the height of the RE tree. A condition for checking RE ambiguity on the transducer graph is stated. Experimental results of running the parser generator and the parser on a large RE collection are presented. The POSIX RE disambiguation criterion has also been applied to the parser.

[1]  Luca Breveglieri,et al.  From Ambiguous Regular Expressions to Deterministic Parsing Automata , 2015, CIAA.

[2]  Shimon Even,et al.  Ambiguity in Graphs and Expressions , 1971, IEEE Transactions on Computers.

[3]  Jean Berstel,et al.  Local Languages and the Berry-Sethi Algorithm , 1996, Theor. Comput. Sci..

[4]  Gérard Berry,et al.  From Regular Expressions to Deterministic Automata , 1986, Theor. Comput. Sci..

[5]  Fritz Henglein,et al.  Two-Pass Greedy Regular Expression Parsing , 2013, CIAA.

[6]  Niraj K. Jha,et al.  Dynamic Binary Instrumentation-Based Framework for Malware Defense , 2008, DIMVA.

[7]  Martin Sulzmann,et al.  POSIX Regular Expression Parsing with Derivatives , 2014, FLOPS.

[8]  Luca Breveglieri,et al.  A Benchmark Production Tool for Regular Expressions , 2019, CIAA.

[9]  B. Watson A taxonomy of finite automata construction algorithms , 1993 .

[10]  Luca Breveglieri,et al.  Formal Languages and Compilation , 2009, Texts in Computer Science.

[11]  Luca Cardelli,et al.  Greedy Regular Expression Matching , 2004, ICALP.

[12]  Hermann Gruber,et al.  From Finite Automata to Regular Expressions and Back - A Summary on Descriptional Complexity , 2015, Int. J. Found. Comput. Sci..

[13]  Martin Sulzmann,et al.  Derivative-Based Diagnosis of Regular Expression Ambiguity , 2017, Int. J. Found. Comput. Sci..

[14]  Ville Laurikari,et al.  NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[15]  Philip Bille,et al.  From Regular Expression Matching to Parsing , 2019, MFCS.

[16]  Cyril Allauzen,et al.  A Unified Construction of the Glushkov, Follow, and Antimirov Automata , 2006, MFCS.

[17]  R. McNaughton,et al.  Counter-Free Automata , 1971 .

[18]  Stuart Haber,et al.  Efficient Submatch Extraction for Practical Regular Expressions , 2013, LATA.

[19]  Taro Suzuki,et al.  Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions , 2010, CIAA.

[20]  Steven M. Kearns,et al.  Extending regular expressions with context operators and parse extraction , 1991, Softw. Pract. Exp..

[21]  Markus Holzer,et al.  From Finite Automata to Regular Expressions and Back - A Summary on Descriptional Complexity , 2014, Int. J. Found. Comput. Sci..

[22]  Marc Feeley,et al.  Efficiently building a parse tree from a regular expression , 2000, Acta Informatica.

[23]  Fritz Henglein,et al.  Bit-coded Regular Expression Parsing , 2011, LATA.

[24]  Oscar Nierstrasz,et al.  Efficiently extracting full parse trees using regular expressions with capture groups , 2015, PeerJ Prepr..