A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields

Abstract In recent years, many applications in natural language processing (NLP) have been developed using the machine learning approach. Annotating data is an important task in applying machine learning to NLP applications. A common approach to improve the system performance is to train on a large and high-quality set of training data that is annotated by experts. Besides, active learning (AL) and self-learning can be utilized to reduce the annotation costs. The self-learning method discovers highly reliable instances based on a trained classifier, while AL queries the most informative instances based on active query algorithms. This paper proposes a method that combines AL and self-learning to reduce the labeling effort for the named entity recognition task from tweet streams by using both machine-labeled and manually-labeled data. We employ AL queries based on the diversity of the context and content of instances to select the most informative instances. The conditional random fields are also chosen as an underlying model to train a classifier for selecting highly reliable instances. The experiments using Twitter data show that the proposed method achieves good results in reducing the human labeling effort, and it can significantly improve the performance of the systems.

[1]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[2]  Lin Yao,et al.  CRF-based active learning for Chinese named entity recognition , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[3]  Diego Marcheggiani,et al.  An Experimental Comparison of Active Learning Strategies for Partially Labeled Sequences , 2014, EMNLP.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[6]  Ming Zhou,et al.  Two-stage NER for tweets with clustering , 2013, Inf. Process. Manag..

[7]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[8]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[9]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[10]  Qi He,et al.  Tweet Segmentation and Its Application to Named Entity Recognition , 2015, IEEE Transactions on Knowledge and Data Engineering.

[11]  Yihao Zhang,et al.  Semi-supervised learning combining co-training with active learning , 2014, Expert Syst. Appl..

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Shiliang Sun,et al.  A review of natural language processing techniques for opinion mining systems , 2017, Inf. Fusion.

[14]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[15]  Mohammad Reza Keyvanpour,et al.  A two-phase hybrid of semi-supervised and active learning approach for sequence labeling , 2013, Intell. Data Anal..

[16]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[17]  Lin Yao,et al.  Combining Self Learning and Active Learning for Chinese Named Entity Recognition , 2010, J. Softw..

[18]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[19]  Ngoc Thanh Nguyen,et al.  A Hybrid Method for Named Entity Recognition on Tweet Streams , 2017, ACIIDS.

[20]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[21]  Ali Selamat,et al.  Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples , 2015, Inf. Sci..

[22]  Jason J. Jung,et al.  TwiSNER: Semi-supervised Method for Named Entity Recognition from Text Streams on Twitter , 2016, J. Univers. Comput. Sci..

[23]  Lawrence O. Hall,et al.  Semi-supervised learning on large complex simulations , 2008, 2008 19th International Conference on Pattern Recognition.

[24]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[25]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.