Practical Earley parsing and the SPARK toolkit

Domain-specific, “little” languages are commonplace in computing. So too is the need to implement such languages; to meet this need, we have created SPARK (Scanning, Parsing, And Rewriting Kit), a toolkit for little language implementation in Python, an object-oriented scripting language. SPARK greatly simplifies the task of little language implementation. It requires little code to be written, and accommodates a wide range of users—even those without a background in compiler theory. Our toolkit is seeing increasing use on a variety of diverse projects. SPARK was designed to be easy-to-use with few limitations, and relies heavily on Earley's general parsing algorithm internally, which helps in meeting these design goals. Earley's algorithm, in its standard form, can be hard to use; indeed, experience with SPARK has highlighted several problems with the practical use of Earley's algorithm. Our research addresses and provides solutions for these problems, making some significant improvements to the implementation and use of Earley's algorithm. First, Earley's algorithm suffers from the performance problem . Even under optimum conditions, a standard Earley parser is burdened with overhead. We extend directly-executable parsing techniques for use in Earley parsers, the results of which run in time comparable to the much-more-specialized LALR(1) parsing algorithm. Second is what we call the delayed action problem. General parsers like Earley must, in the worst case, read the entire input before executing any semantic actions associated with the grammar rules. We attack this problem in two ways. We have identified conditions under which it is safe to execute semantic actions on the fly during recognition; as a side effect, this has yielded space savings of over 90% for some grammars. The other approach to the delayed action problem deals with the difficulty of handling context-dependent tokens. Such tokens are easy to handle using what we call “Schrodinger's tokens,” a superposition of token types. Finally, Earley parsers are complicated by the need to process grammar rules with empty right-hand sides. We present a simple, efficient way to handle these empty rules, and prove that our new method is correct. We also show how our method may be used to create a new type of LR(0) automaton which is ideally suited for use in Earley parsers. Our work has made Earley parsing faster and more space-efficient, turning it into an excellent candidate for practical use in many applications.

[1]  Alain Pirotte,et al.  Efficient parsing algorithms for general context-free parsers , 1975, Inf. Sci..

[2]  Tim Berners-Lee,et al.  Hypertext transfer protocol--http/i , 1993 .

[3]  R. Nigel Horspool,et al.  Faster Generalized LR Parsing , 1999, CC.

[4]  Prof. Dr. Hanspeter Mössenböck Object-Oriented Programming in Oberon-2 , 1995, Springer Berlin Heidelberg.

[5]  René Leermakers,et al.  A Recursive Ascent Earley Parser , 1992, Inf. Process. Lett..

[6]  Paul Klint,et al.  Interpretation Techniques , 1981, Softw. Pract. Exp..

[7]  Eelco Visser,et al.  Using Filters for the Disambiguation of Context-free Grammars , 1994 .

[8]  R. Nigel Horspool,et al.  A Faster Earley Parser , 1996, CC.

[9]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[10]  Chris Verhoef,et al.  Current parsing techniques in software renovation considered harmful , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[11]  Niklaus Wirth,et al.  What can we do about the unnecessary diversity of notation for syntactic definitions? , 1977, Commun. ACM.

[12]  F. Weingarten,et al.  Translation of Computer Languages , 1973 .

[13]  Peter Pfahler Optimizing Directly Executable LR Parsers , 1990, CC.

[14]  Alfred V. Aho,et al.  The theory of parsing, translation, and compiling. 1: Parsing , 1972 .

[15]  Jon Postel,et al.  File Transfer Protocol , 1985, RFC.

[16]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[17]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[18]  Paul Klint,et al.  Incremental generation of parsers , 1989, PLDI '89.

[19]  John Aycock,et al.  Compiling Little Languages in Python , 1998 .

[20]  Bernard Lang,et al.  The Structure of Shared Forests in Ambiguous Parsing , 1989, ACL.

[21]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[22]  Murray Hill,et al.  Yacc: Yet Another Compiler-Compiler , 1978 .

[23]  Arthur H. J. Sale,et al.  The Classification of FORTRAN Statements , 1971, Comput. J..

[24]  Thomas J. Pennello,et al.  Very fast LR parsing , 1986, SIGPLAN '86.

[25]  Bjarne Stroustrup,et al.  C++ : programovací jazyk : The C++ programming language (Orig.) , 1997 .

[26]  Tony Mason,et al.  Lex & Yacc , 1992 .

[27]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[28]  Jorma Tarhio LR Parsing of Some Ambiguous Grammars , 1982, Inf. Process. Lett..

[29]  Friedrich L. Bauer,et al.  The “Plankalkül” of Konrad Zuse: a forerunner of today's programming languages , 1972, CACM.

[30]  John Aycock,et al.  An architecture for easy Web page updating , 1999, CROS.

[31]  Borivoj Melichar,et al.  Even faster generalized LR parsing , 2001, Acta Informatica.

[32]  David M. Beazley,et al.  Python Essential Reference , 1999 .

[33]  Larry R. Harris,et al.  Understanding natural language using a variable grammar , 1975 .

[34]  Chris Clark Keywords: special identifier idioms , 1999, SIGP.

[35]  Andrew W. Appel,et al.  Modern Compiler Implementation in Java , 1997 .

[36]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[37]  Achyutram Bhamidipaty,et al.  Very Fast YACC-Compatible Parsers (For Very Little Effort) , 1998, Softw. Pract. Exp..

[38]  R. Nigel Horspool,et al.  Directly-Executable Earley Parsing , 2001, CC.

[39]  Christopher W. Fraser,et al.  Engineering a simple, efficient code-generator generator , 1992, LOPL.

[40]  Charles N. Fischer,et al.  Crafting a Compiler , 1988 .

[41]  Walter L. Ruzzo,et al.  An Improved Context-Free Recognizer , 1980, ACM Trans. Program. Lang. Syst..

[42]  Miguel A. Alonso,et al.  Construction of Efficient Generalized LR Parsers , 1997, Workshop on Implementing Automata.

[43]  Robert Milne Lexical Ambiguity Resolution in a Deterministic Parser , 1988 .

[44]  Eelco Visser,et al.  Scannerless Generalized-LR Parsing , 1997 .

[45]  Martin E. Nordberg Variations on the Visitor Pattern , 1996 .

[46]  Michael R. Levy Web programming in guide , 1998 .

[47]  Donald E. Knuth,et al.  The Early Development of Programming Languages. , 1977 .

[48]  Wolfgang K. Giloi Konrad Zuse's Plankalkül: The First High-Level, "non von Neumann" Programming Language , 1997, IEEE Ann. Hist. Comput..

[49]  Stuart I. Feldman,et al.  Implementation of a portable Fortran 77 compiler using modern tools , 1979, SIGPLAN '79.

[50]  René Leermakers Recursive Ascent Parsing: From Earley to Marcus , 1992, Theor. Comput. Sci..

[51]  John Aycock Converting Python Virtual Machine Code to C , 1998 .

[52]  James Kanze Handling ambiguous tokens in LR-Parsers , 1989, SIGP.

[53]  Manuel Vilares Ferro,et al.  Efficient incremental parsing for context-free languages , 1994, Proceedings of 1994 IEEE International Conference on Computer Languages (ICCL'94).

[54]  R. Nigel Horspool,et al.  Simple Generation of Static Single-Assignment Form , 2000, CC.

[55]  Alfred V. Aho,et al.  Deterministic parsing of ambiguous grammars , 1975, Commun. ACM.

[56]  RAINER KOPPLER,et al.  A Systematic Approach to Fuzzy Parsing , 1997, Softw. Pract. Exp..

[57]  Thomas W. Christopher,et al.  Using dynamic programming to generate optimized code in a Graham-Glanville style code generator , 1984, SIGPLAN '84.

[58]  Thilo Ernst TRAPing Modelica with Python , 1999, CC.

[59]  Olin Shivers A Universal Scripting Framework or Lambda: The Ultimate "Little Language" , 1996, ASIAN.

[60]  Gordon V. Cormack,et al.  Corrections to the paper: Scannerless NSLR(1) Parsing of Programming Languages , 1989, SIGP.