Automatic string replace by examples

Search-and-replace is a text processing task which may be largely automated with regular expressions: the user must describe with a specific formal language the regions to be modified (search pattern) and the corresponding desired changes (replacement expression). Writing and tuning the required expressions requires high familiarity with the corresponding formalism and is typically a lengthy, error-prone process. In this paper we propose a tool based on Genetic Programming (GP) for generating automatically both the search pattern and the replacement expression based only on examples. The user merely provides examples of the input text along with the desired output text and does not need any knowledge about the regular expression formalism nor about GP. We are not aware of any similar proposal. We experimentally evaluated our proposal on 4 different search-and-replace tasks operating on real-world datasets and found good results, which suggests that the approach may indeed be practically viable.

[1]  Rohit Babbar,et al.  Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text , 2010, AND '10.

[2]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[3]  Frederick E. Petry,et al.  Regular language induction with genetic programming , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[4]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[5]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[6]  Rob Miller,et al.  Lightweight Structured Text Processing , 1999, USENIX Annual Technical Conference, General Track.

[7]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.

[8]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[9]  María Dolores Rodríguez-Moreno,et al.  A Case Study on Grammatical-Based Representation for Regular Expression Evolution , 2010, PAAMS.

[10]  Rob Miller,et al.  Multiple selections in smart text editing , 2002, IUI '02.

[11]  Eric Medvet,et al.  Automatic generation of regular expressions from examples with genetic programming , 2012, GECCO '12.

[12]  Rob Miller,et al.  Cluster-based find and replace , 2004, CHI '04.

[13]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[14]  Rob Miller,et al.  LAPIS: smart editing with text structure , 2002, CHI Extended Abstracts.

[15]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[16]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[17]  Efim B. Kinber Learning Regular Expressions from Representative Examples and Membership Queries , 2010, ICGI.

[18]  Eric Medvet,et al.  Brand-Related Events Detection, Classification and Summarization on Twitter , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[19]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data: Research Articles , 2005 .