Provenance-based dictionary refinement in information extraction

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

[1]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[2]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[3]  Jan Vondrák,et al.  Maximizing conjunctive views in deletion propagation , 2012, TODS.

[4]  Frederick Reiss,et al.  Automatic rule refinement for information extraction , 2010, Proc. VLDB Endow..

[5]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[8]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[9]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[10]  Yehoshua Perl,et al.  Clustering and domination in perfect graphs , 1984, Discret. Appl. Math..

[11]  Suman Nath,et al.  Tracing data errors with view-conditioned causality , 2011, SIGMOD '11.

[12]  David Eppstein,et al.  Choosing Subsets with Maximum Weighted Average , 1997, J. Algorithms.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[15]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[16]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[17]  Sharad Mehrotra,et al.  XAR: An Integrated Framework for Information Extraction , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[18]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[19]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[20]  Kalina Bontcheva,et al.  Towards a semantic extraction of named entities , 2003 .

[21]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[22]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[23]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[24]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[25]  Zornitsa Kozareva Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists , 2006, EACL.

[26]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[28]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[29]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[30]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[33]  Dan Suciu,et al.  Computing query probability with incidence algebras , 2010, PODS '10.

[34]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[35]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[36]  Jeffrey F. Naughton,et al.  Efficiently incorporating user feedback into information extraction and integration programs , 2009, SIGMOD Conference.