From Regular Expressions to DFA's Using Compressed NFA's

We show how to turn a regular expression R of length r into an O(s) space representation of McNaughton and Yamada's NFA, where s is the number of occurrences of alphabet symbols in R, and s + 1 is the number of NFA states. The standard adjacency list representation of McNaughton and Yamada's NFA takes up 1 + 2s + $s\sp2$ space in the worst case. The adjacency list representation of the NFA produced by Thompson takes up between 2r and 6r space, where r can be arbitrarily larger than s. Given any subset V of states in McNaughton and Yamada's NFA, our representation can be used to compute the set U of states one transition away from the states in V in optimal time O($\vert V\vert + \vert U\vert$). McNaughton and Yamada's NFA requires $\Theta$($\vert V\vert \times \vert U\vert$) time in the worst case. Using Thompson's NFA, the equivalent calculation requires $\Theta$(r) time in the worst case. An implementation of our NFA representation confirms that it takes up an order of magnitude less space than McNaughton and Yamada's machine. An implementation to produce a DFA from our NFA representation by subset construction shows linear and quadratic speedups over subset construction starting from both Thompson's and McNaughton and Yamada's NFA's. It also shows that the DFA produced from our NFA is as much as one order of magnitude smaller than DFA's constructed from the two other NFA's. An UNIX egrep compatible software called cgrep based on our NFA representation is implemented. A benchmark shows that cgrep is dramatically faster than both UNIX egrep and GNU e?grep. Throughout this thesis the importance of syntax is stressed in the design of our algorithms. In particular, we exploit a method of program improvement in which costly repeated calculations can be avoided by establishing and maintaining program invariants. This method of symbolic finite differencing has been used previously by Douglas Smith to derive efficient functional programs.

[1]  Douglas R. Smith,et al.  KIDS - A Knowledge-Based Software Development System , 1991 .

[2]  Chia-Hsiang Chang,et al.  From Regular Expressions to DFA's Using Compressed NFA's , 1992, CPM.

[3]  A. Retrospective,et al.  The UNIX Time-sharing System , 1977 .

[4]  Robert Paige,et al.  Using Multiset Discrimination to Solve Language Processing Problems Without Hashing , 1995, Theor. Comput. Sci..

[5]  Janusz A. Brzozowski,et al.  Derivatives of Regular Expressions , 1964, JACM.

[6]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[7]  MyersGene A Four Russians algorithm for regular expression pattern matching , 1992 .

[8]  G. Winskel The formal semantics of programming languages , 1993 .

[9]  S C Kleene,et al.  Representation of Events in Nerve Nets and Finite Automata , 1951 .

[10]  Robert E. Tarjan,et al.  Making data structures persistent , 1986, STOC '86.

[11]  A. Nerode,et al.  Linear automaton transformations , 1958 .

[12]  Gérard Berry,et al.  The ESTEREL Synchronous Programming Language and its Mathematical Semantics , 1984, Seminar on Concurrency.

[13]  Micha Sharir,et al.  Some Observations Concerning Formal Differentiation of Set Theoretic Expressions , 2011, TOPL.

[14]  A. S. Sethi,et al.  Bibliography on network management , 1989, CCRV.

[15]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[16]  Robert Paige,et al.  Look ma, no hashing, and no arrays neither , 1991, POPL '91.

[17]  Donald E. Knuth,et al.  On the Translation of Languages from Left to Right , 1965, Inf. Control..

[18]  Anne Brüggemann-Klein Regular Expressions into Finite Automata , 1993, Theor. Comput. Sci..

[19]  Robert McNaughton,et al.  Regular Expressions and State Graphs for Automata , 1960, IRE Trans. Electron. Comput..

[20]  B. Leupen,et al.  Design and analysis , 1997 .

[21]  Anne Brüggemann-Klein,et al.  Regular Expressions into Finite Automata , 1992, Theor. Comput. Sci..

[22]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[23]  Gérard Berry,et al.  From Regular Expressions to Deterministic Automata , 1986, Theor. Comput. Sci..

[24]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[25]  Jeffrey D Ullma Computational Aspects of VLSI , 1984 .

[26]  Jeffrey D. Ullman,et al.  Formal languages and their relation to automata , 1969, Addison-Wesley series in computer science and information processing.

[27]  Bell Telephone,et al.  Regular Expression Search Algorithm , 1968 .

[28]  Dana S. Scott,et al.  Finite Automata and Their Decision Problems , 1959, IBM J. Res. Dev..