Enabling information extraction by inference of regular expressions from sample entities

Regular expressions are the dominant technique to extract business relevant entities (e.g., invoice numbers or product names) from text data (e.g., invoices), since these entity types often follow a strict underlying syntactical pattern. However, the manual construction of regular expressions that guarantee a high recall and precision is a tedious manual task and requires expert knowledge. In this paper, we propose an approach that automatically infers regular expressions from a set of (positive) sample entities, which in turn can be derived either from enterprise databases (e.g., a product catalog) or annotated documents (e.g., historical invoices). The main innovation of our approach is that it learns effective regular expressions that can be easily interpreted and modified by a user. The effectiveness is obtained by a novel method that weights dependent entity features of different granularity (i.e. on character and token level) against each other and selects the most suitable ones to form a regular expression.

[1]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[2]  Ulf Leser,et al.  High-performance information extraction with AliBaba , 2009, EDBT '09.

[3]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[4]  Thorsten Brants,et al.  A Context Pattern Induction Method for Named Entity Extraction , 2006, CoNLL.

[5]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[6]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[7]  Alan W. Biermann,et al.  Two Dimensional Generalization in Information Extraction , 1999, AAAI/IAAI.

[8]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[10]  Henning Fernau,et al.  Algorithms for learning regular expressions from positive data , 2009, Inf. Comput..

[11]  Rajesh Parekh,et al.  Automata Induction, Grammar Inference, and Language Acquisition , 2000 .

[12]  Gerhard Weikum,et al.  NAGA: harvesting, searching and ranking knowledge , 2008, SIGMOD Conference.

[13]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[14]  Automata Induction , 2009, Encyclopedia of Database Systems.

[15]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[16]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[19]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[20]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.