A Practical Incremental Learning Framework For Sparse Entity Extraction

This work addresses challenges arising from extracting entities from textual data, including the high cost of data annotation, model accuracy, selecting appropriate evaluation criteria, and the overall quality of annotation. We present a framework that integrates Entity Set Expansion (ESE) and Active Learning (AL) to reduce the annotation cost of sparse data and provide an online evaluation method as feedback. This incremental and interactive learning framework allows for rapid annotation and subsequent extraction of sparse data while maintaining high accuracy. We evaluate our framework on three publicly available datasets and show that it drastically reduces the cost of sparse entity annotation by an average of 85% and 45% to reach 0.9 and 1.0 F-Scores respectively. Moreover, the method exhibited robust performance across all datasets.

[1]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[2]  Dan Wu,et al.  Conditional Random Fields with High-Order Features for Sequence Labeling , 2009, NIPS.

[3]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Yifan He,et al.  An Information Extraction Customizer , 2014, TSD.

[6]  Ralph Grishman,et al.  Fine-grained Entity Set Refinement with User Feedback , 2011 .

[7]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[8]  Jiawei Han,et al.  SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble , 2017, ECML/PKDD.

[9]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[10]  Josef Ruppenhofer,et al.  Assessing the benefits of partial automatic pre-labeling for frame-semantic annotation , 2009, Linguistic Annotation Workshop.

[11]  Kugatsu Sadamitsu,et al.  Entity Set Expansion using Interactive Topic Information , 2012, PACLIC.

[12]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[13]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[14]  Sophia Ananiadou,et al.  Accelerating the annotation of sparse named entities by dynamic sentence selection , 2008, BMC Bioinformatics.

[15]  Anthony N. Nguyen,et al.  Active learning reduces annotation time for clinical concept extraction , 2017, Int. J. Medical Informatics.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[18]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  Valentin Jijkoun,et al.  "More like these": growing entity classes from seeds , 2007, CIKM '07.

[21]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[22]  Yang Li,et al.  Leveraging Pattern Semantics for Extracting Entities in Enterprises , 2015, WWW.

[23]  Louise Deléger,et al.  Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements , 2013, J. Am. Medical Informatics Assoc..

[24]  Gary Geunbae Lee,et al.  MMR-based Active Machine Learning for Bio Named Entity Recognition , 2006, NAACL.

[25]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[26]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[27]  Zhe Chen,et al.  EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion , 2016, WSDM.