Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach

Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved. key words: Chinese word segmentation, active learning, deep neural networks, corpus expansion

[1]  Hai Zhao,et al.  Neural Word Segmentation Learning for Chinese , 2016, ACL.

[2]  Hua Xu,et al.  A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[3]  Feng Chong,et al.  Active Learning in Chinese Word Segmentation Based on Multigram Language Model , 2006 .

[4]  Yang Liu,et al.  Improving Named Entity Recognition in Tweets via Detecting Non-Standard Words , 2015, ACL.

[5]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[6]  Xu Sun,et al.  Dependency-based Gated Recursive Neural Network for Chinese Word Segmentation , 2016, ACL.

[7]  Yoshimasa Tsuruoka,et al.  Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data , 2011, IJCNLP.

[8]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[9]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[10]  Yue Wang,et al.  Recurrent Neural Word Segmentation with Tag Inference , 2016, NLPCC/ICCPOL.

[11]  Peng Qian,et al.  Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts , 2016, NLPCC/ICCPOL.

[12]  Masaki Aono,et al.  Microblog Retrieval Using Ensemble of Feature Sets through Supervised Feature Selection , 2017, IEICE Trans. Inf. Syst..

[13]  Huang,et al.  Context Information and Fragments Based Cross-Domain Word Segmentation , 2012 .

[14]  Chu-Ren Huang,et al.  Active Learning for Chinese Word Segmentation , 2012, COLING.

[15]  Ming Zhou,et al.  Joint Inference of Named Entity Recognition and Normalization for Tweets , 2012, ACL.

[16]  Katsumi Tanaka,et al.  Entity Identification on Microblogs by CRF Model with Adaptive Dependency , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[17]  Xuanjing Huang,et al.  Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts , 2015, NLPCC.

[18]  Lei Gu,et al.  基于最近邻的主动学习分词方法 (Active Learning in Chinese Word Segmentation Based on Nearest Neighbor) , 2015, 计算机科学.

[19]  Kiyoaki Shirai,et al.  Topic Modeling based Sentiment Analysis on Social Media for Stock Market Prediction , 2015, ACL.

[20]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[21]  Wei Wang,et al.  Rules-based Chinese Word Segmentation on MicroBlog for CIPS-SIGHAN on CLP2012 , 2012, CIPS-SIGHAN.

[22]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[23]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[24]  Xuanjing Huang,et al.  Gated Recursive Neural Network for Chinese Word Segmentation , 2015, ACL.

[25]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[26]  Baobao Chang,et al.  Max-Margin Tensor Neural Network for Chinese Word Segmentation , 2014, ACL.