“Maximal-munch” tokenization in linear time

The lexical-analysis (or scanning) phase of a compiler attempts to partition an input string into a sequence of tokens. The convention in most languages is that the input is scanned left to right, and each token identified is a “maximal munch” of the remaining input—the longest prefix of the remaining input that is a token of the language. Although most of the standard compiler textbooks present a way to perform maximal-munch tokenization, the algorithm they describe is one that, for certain sets of token definitions, can cause the scanner to exhibit quadratic behavior in the worst case. In the article, we show that maximal-munch tokenization can always be performed in time linear in the size of the input.

[1]  Neil D. Jones,et al.  Generalizing Cook's Transformation to Imperative Stack Programs , 1994, Results and Trends in Theoretical Computer Science.

[2]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[3]  Marc A. de Kruijf Compiler Construction , 1996, Lecture Notes in Computer Science.

[4]  RepsThomas Maximal-munch tokenization in linear time , 1998 .

[5]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[6]  Alfred V. Aho,et al.  Principles of Compiler Design , 1977 .

[7]  Alfred V. Aho,et al.  Principles of Compiler Design (Addison-Wesley series in computer science and information processing) , 1977 .

[8]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 1: Sorting and Searching , 2011, EATCS Monographs on Theoretical Computer Science.

[9]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[10]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[11]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .

[12]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[13]  Dieter Maurer,et al.  Compiler Design , 2013, Springer Berlin Heidelberg.

[14]  Neil D. Jones,et al.  A Note on Linear Time Simulation of Deterministic Two-Way Pushdown Automata , 1977, Information Processing Letters.

[15]  Stephen A. Cook,et al.  Linear Time Simulation of Deterministic Two-Way Pushdown Automata , 1971, IFIP Congress.

[16]  Frank DeRemer,et al.  Lexical Analysis , 1976, Handbook of Natural Language Processing.

[17]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[18]  Charles N. Fischer,et al.  Crafting a Compiler , 1988 .

[19]  Torben Æ. Mogensen WORM-2DPDAs: An Extension to 2DPDAs that can be Simulated in Linear Time , 1994, Inf. Process. Lett..