Learning Text Patterns Using Separate-and-Conquer Genetic Programming

The problem of extracting knowledge from large volumes of unstructured textual information has become increasingly important. We consider the problem of extracting text slices that adhere to a syntactic pattern and propose an approach capable of generating the desired pattern automatically, from a few annotated examples. Our approach is based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing. Key feature of our proposal is its ability of discovering automatically whether the extraction task may be solved by a single pattern, or rather a set of multiple patterns is required. We obtain this property by means of a separate-and-conquer strategy: once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. Our proposal outperforms an earlier state-of-the-art approach on three challenging datasets.

[1]  Butler W. Lampson,et al.  A Machine Learning Framework for Programming by Example , 2013, ICML.

[2]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[3]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[4]  María Dolores Rodríguez-Moreno,et al.  Adapting Searchy to extract data using evolved wrappers , 2012, Expert Syst. Appl..

[5]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[6]  Eric Medvet,et al.  Automatic string replace by examples , 2013, GECCO '13.

[7]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[8]  Eric Medvet,et al.  Automatic Synthesis of Regular Expressions from Examples , 2014, Computer.

[9]  Alex Alves Freitas,et al.  Evolving rule induction algorithms with multi-objective grammar-based genetic programming , 2009, Knowledge and Information Systems.

[10]  Efim B. Kinber Learning Regular Expressions from Representative Examples and Membership Queries , 2010, ICGI.

[11]  Alex Alves Freitas,et al.  A hyper-heuristic evolutionary algorithm for automatically designing decision-tree algorithms , 2012, GECCO '12.

[12]  Simon M. Lucas,et al.  Learning deterministic finite automata with a smart state labeling evolutionary algorithm , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[14]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[15]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[16]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[17]  Eric Medvet,et al.  Playing regex golf with genetic programming , 2014, GECCO.

[18]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[19]  María Dolores Rodríguez-Moreno,et al.  Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions , 2009, Data Mining and Multi-agent Integration.