A Case Study on Grammatical-Based Representation for Regular Expression Evolution

Regular expressions, or simply regex, have been widely used as a powerful pattern matching and text extractor tool through decades. Although they provide a powerful and flexible notation to define and retrieve patterns from text, the syntax and the grammatical rules of these regex notations are not easy to use, and even to understand. Any regex can be represented as a Deterministic or Non-Deterministic Finite Automata; so it is possible to design a representation to automatically build a regex, and a optimization algorithm able to find the best regex in terms of complexity. This paper introduces both, a graph-based representation for regex, and a particular heuristic-based evolutionary computing algorithm based on grammatical features from this language in a particular data extraction problem.

[1]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[2]  S C Kleene,et al.  Representation of Events in Nerve Nets and Finite Automata , 1951 .

[3]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[4]  Frederick E. Petry,et al.  Regular language induction with genetic programming , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[5]  Bell Telephone,et al.  Regular Expression Search Algorithm , 1968 .

[6]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[7]  Chia-Hsiang Chang,et al.  From Regular Expressions to DFA's Using Compressed NFA's , 1992, CPM.

[8]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[9]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[10]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.