How to Invest my Time: Lessons from Human-in-the-Loop Entity Extraction

Recognizing entities that follow or closely resemble a regular expression (regex) pattern is an important task in information extraction. Common approaches for extraction of such entities require humans to either write a regex recognizing an entity or manually label entity mentions in a document corpus. While human effort is critical to build an entity recognition model, surprisingly little is known about how to best invest that effort given a limited time budget. To get an answer, we consider an iterative human-in-the-loop (HIL) framework that allows users to write a regex or manually label entity mentions, followed by training and refining a classifier based on the provided information. We demonstrate on 5 entity recognition tasks that classification accuracy improves over time with either approach. When a user is allowed to choose between regex construction and manual labeling, we discover that (1) if the time budget is low, spending all time for regex construction is often advantageous, (2) if the time budget is high, spending all time for manual labeling seems to be superior, and (3) between those two extremes, writing regexes followed by manual labeling is typically the best approach. Our code and data is available at https://github.com/nymph332088/HILRecognizer.

[1]  Shanshan Zhang,et al.  Regular Expression Guided Entity Mention Mining from Noisy Web Data , 2018, EMNLP.

[2]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[3]  Christopher Ré,et al.  SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data , 2017, ArXiv.

[4]  Henning Fernau,et al.  Algorithms for learning regular expressions from positive data , 2009, Inf. Comput..

[5]  Hua Xu,et al.  A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[6]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[7]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[8]  Qing Zeng-Treitler,et al.  Regular expression-based learning to extract bodyweight values from clinical notes , 2015, J. Biomed. Informatics.

[9]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[10]  Teng Ren,et al.  Learning Named Entity Tagger using Domain-Specific Dictionary , 2018, EMNLP.

[11]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[12]  Claudiu Musat,et al.  Unsupervised Aspect Term Extraction with B-LSTM & CRF using Automatically Labelled Datasets , 2017, WASSA@EMNLP.

[13]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[14]  Sameep Mehta,et al.  Content and Context: Two-Pronged Bootstrapped Learning for Regex-Formatted Entity Extraction , 2018, AAAI.

[15]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[16]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[17]  E. Medvet,et al.  Inference of Regular Expressions for Text Extraction from Examples , 2016, IEEE Transactions on Knowledge and Data Engineering.

[18]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[19]  Eric Medvet,et al.  Active Learning of Regular Expressions for Entity Extraction , 2018, IEEE Transactions on Cybernetics.

[20]  Benjamin Livshits,et al.  Program Boosting , 2015, POPL.

[21]  Eric Medvet,et al.  Automatic generation of regular expressions from examples with genetic programming , 2012, GECCO '12.

[22]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[23]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[24]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[25]  Karin Murthy,et al.  Improving Recall of Regular Expressions for Information Extraction , 2012, WISE.

[26]  Eric Medvet,et al.  Automatic Synthesis of Regular Expressions from Examples , 2014, Computer.