Automatic rule refinement for information extraction

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.

[1]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[2]  Claire Cardie,et al.  UMass/Hughes: Description of the CIRCUS System Used for MUC-51 , 1993, MUC.

[3]  Jeffrey F. Naughton,et al.  On the provenance of non-answers to queries over extracted data , 2008, Proc. VLDB Endow..

[4]  Branimir Boguraev,et al.  Annotation-based finite state processing in a large-scale NLP arhitecture , 2003, RANLP.

[5]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[8]  Sharad Mehrotra,et al.  XAR: An Integrated Framework for Information Extraction , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[9]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[11]  Luis Gravano,et al.  Join Optimization of Information Extraction Output: Quality Matters! , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[13]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[14]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[15]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[16]  Melanie Herschel,et al.  Explaining missing answers to SPJUA queries , 2010, Proc. VLDB Endow..

[17]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[18]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[19]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[20]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Text , 1993, HLT.

[21]  Stephen Glenn Soderland,et al.  Learning text analysis rules for domain-specific natural language processing , 1996 .

[22]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[24]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[25]  Adriane Chapman,et al.  Why Not? , 1965, SIGMOD Conference.

[26]  Divesh Srivastava,et al.  I4E: interactive investigation of iterative information extraction , 2010, SIGMOD Conference.

[27]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[28]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[29]  Ralph Grishman,et al.  Extracting Relations with Integrated Information Using Kernel Methods , 2005, ACL.

[30]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[31]  J. Patel,et al.  Declarative Querying for Biological Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Val Tannen,et al.  Provenance semirings , 2007, PODS.