Towards Efficient Named-Entity Rule Induction for Customizability

Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a four-stage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization of basic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Mary Elaine Califf and Raymond J. Mooney,et al.  Applying ILP-based Techniques to Natural Language Information Extraction: An Experiment in Relational Learning , 1997 .

[3]  Shan-Hwei Nienhuys-Cheng,et al.  Foundations of Inductive Logic Programming , 1997, Lecture Notes in Computer Science.

[4]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[5]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[8]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[9]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[10]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[11]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[12]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[13]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[15]  Frederick Reiss,et al.  Automatic rule refinement for information extraction , 2010, Proc. VLDB Endow..

[16]  Kalina Bontcheva,et al.  Towards a semantic extraction of named entities , 2003 .

[17]  Pushpak Bhattacharyya,et al.  Incorporating Linguistic Expertise Using ILP for Named Entity Recognition in Data Hungry Indian Languages , 2009, ILP.

[18]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[19]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[20]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[21]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[22]  Johannes Fürnkranz,et al.  Incremental Reduced Error Pruning , 1994, ICML.

[23]  Brian R. Gaines,et al.  Induction of ripple-down rules applied to modeling large databases , 1995, Journal of Intelligent Information Systems.