Regular Expression Guided Entity Mention Mining from Noisy Web Data

Many important entity types in web documents, such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.

[1]  Dongyan Zhao,et al.  Marrying Up Regular Expressions with Neural Networks: A Case Study for Spoken Language Understanding , 2018, ACL.

[2]  Eric Medvet,et al.  Automatic generation of regular expressions from examples with genetic programming , 2012, GECCO '12.

[3]  Eric Medvet,et al.  Active Learning of Regular Expressions for Entity Extraction , 2018, IEEE Transactions on Cybernetics.

[4]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[5]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[6]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[7]  Iyad Rahwan,et al.  Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[8]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[9]  Eric Medvet,et al.  Automatic Synthesis of Regular Expressions from Examples , 2014, Computer.

[10]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[11]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[12]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[13]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[15]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Eric P. Xing,et al.  Harnessing Deep Neural Networks with Logic Rules , 2016, ACL.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[19]  Regina Barzilay,et al.  Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge , 2016, EMNLP.

[20]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[21]  E. Medvet,et al.  Inference of Regular Expressions for Text Extraction from Examples , 2016, IEEE Transactions on Knowledge and Data Engineering.

[22]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[23]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[24]  Tobias Scheffer,et al.  Learning to Identify Regular Expressions that Describe Email Campaigns , 2012, ICML.

[25]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[26]  Yun Fu,et al.  Examples-Rules Guided Deep Neural Network for Makeup Recommendation , 2017, AAAI.

[27]  Henning Fernau,et al.  Algorithms for learning regular expressions from positive data , 2009, Inf. Comput..

[28]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[29]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[30]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[31]  Benjamin Livshits,et al.  Program Boosting , 2015, POPL.

[32]  Karin Murthy,et al.  Improving Recall of Regular Expressions for Information Extraction , 2012, WISE.

[33]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.