Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorithmically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.

[1]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[2]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[3]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[4]  Eugene W. Myers,et al.  A Subquadratic Algorithm for Approximate Regular Expression Matching , 1995, J. Algorithms.

[5]  Vijil Chenthamarakshan,et al.  Rule based synonyms for entity extraction from noisy text , 2008, AND '08.

[6]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[7]  Gonzalo Navarro,et al.  New Techniques for Regular Expression Searching , 2005, Algorithmica.

[8]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[9]  Enrique Vidal,et al.  What Is the Search Space of the Regular Inference? , 1994, ICGI.

[10]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data: Research Articles , 2005 .

[11]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Pierre Dupont,et al.  Incremental regular inference , 1996, ICGI.

[13]  Eric Brill,et al.  Pattern-Based Disambiguation for Natural Language Processing , 2000, EMNLP.

[14]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[15]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[16]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[17]  Lipika Dey,et al.  Opinion mining from noisy text data , 2008, AND '08.

[18]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[19]  Graham Wilcock,et al.  Unstructured Information Management Architecture (UIMA) , 2009 .

[20]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[21]  Frank Puppe,et al.  Meta-Level Information Extraction , 2009, LWA.

[22]  F. Puppe,et al.  TextMarker : A Tool for Rule-Based Information Extraction , 2009 .