Playing regex golf with genetic programming

Regex golf has recently emerged as a specific kind of code golf, i.e., unstructured and informal programming competitions aimed at writing the shortest code solving a particular problem. A problem in regex golf consists in writing the shortest regular expression which matches all the strings in a given list and does not match any of the strings in another given list. The regular expression is expected to follow the syntax of a specified programming language, e.g., Javascript or PHP. In this paper, we propose a regex golf player internally based on Genetic Programming. We generate a population of candidate regular expressions represented as trees and evolve such population based on a multi-objective fitness which minimizes the errors and the length of the regular expression. We assess experimentally our player on a popular regex golf challenge consisting of 16 problems and compare our results against those of a recently proposed algorithm---the only one we are aware of.Our player obtains scores which improve over the baseline and are highly competitive also with respect to human players. The time for generating a solution is usually in the order of tens minutes, which is arguably comparable to the time required by human players.

[1]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[2]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[3]  Stefan C. Kremer,et al.  Inducing Grammars from Sparse Data Sets: A Survey of Algorithms and Results , 2003, J. Mach. Learn. Res..

[4]  Hod Lipson,et al.  Active Coevolutionary Learning of Deterministic Finite Automata , 2005, J. Mach. Learn. Res..

[5]  Simon M. Lucas,et al.  Learning deterministic finite automata with a smart state labeling evolutionary algorithm , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[7]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data: Research Articles , 2005 .

[8]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[9]  Pedro García,et al.  Universal automata and NFA learning , 2008, Theor. Comput. Sci..

[10]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[11]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.

[12]  Efim B. Kinber Learning Regular Expressions from Representative Examples and Membership Queries , 2010, ICGI.

[13]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[14]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[15]  Wojciech Wieczorek Induction of Non-Deterministic Finite Automata on Supercomputers , 2012, ICGI.

[16]  Eric Medvet,et al.  Automatic generation of regular expressions from examples with genetic programming , 2012, GECCO '12.

[17]  Butler W. Lampson,et al.  A Machine Learning Framework for Programming by Example , 2013, ICML.

[18]  Eric Medvet,et al.  Automatic Synthesis of Regular Expressions from Examples , 2014, Computer.