Entity Subword Encoding for Chinese Long Entity Recognition

Named entity recognition (NER) is a fundamental and important task in natural language processing area, which jointly predicts entity boundaries and pre-defined categories. For Chinese NER task, recognition of long entities has not been well addressed yet. When character sequences of entities become longer, Chinese NER becomes more difficult with existing character-based and word-based neural methods. In this paper, we investigate Chinese NER methods that operate on subword units and propose to recognize Chinese long entities based on subword encoding. Firstly, our method generates subword units on known entities, which prevents noisy information brought by Chinese word segmentation and eases the determination of long entity boundaries. Then subword-character mixed sequences of sentences are served as input into character-based neural methods to perform Chinese NER. We apply our method on iterated dilated convolutional neural networks (ID-CNNs) and conditional random fields (CRF) for entity recognition. Experimental results on the benchmark People’s Daily and Weibo datasets show that our subword-based method achieves significant performance on long entity recognition.

[1]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[2]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[3]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[4]  Yue Zhang,et al.  Subword Encoding in Lattice LSTM for Chinese Word Segmentation , 2018, NAACL.

[5]  Masanori Hattori,et al.  Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition , 2016, NLPCC/ICCPOL.

[6]  Jun Zhao,et al.  Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism , 2018, EMNLP.

[7]  Nanyun Peng,et al.  Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings , 2015, EMNLP.

[8]  Andrew McCallum,et al.  Fast and Accurate Entity Recognition with Iterated Dilated Convolutions , 2017, EMNLP.

[9]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[10]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[11]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[12]  James Hammerton,et al.  Named Entity Recognition with Long Short-Term Memory , 2003, CoNLL.

[13]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.

[14]  Yi Qian,et al.  Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Nanyun Peng,et al.  Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning , 2016, ACL.

[17]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[18]  Jun Zhao,et al.  Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks , 2015, ACL.

[19]  Yue Zhang,et al.  Chinese NER Using Lattice LSTM , 2018, ACL.