Chinese New Word Identification: A Latent Discriminative Model with Global Features

Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called “Global Fragment Features” for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.

[1]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[2]  Guodong Zhou A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation , 2005, IJCNLP.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Li Wen-hua A Study on Automatic Identification for Internet New Words According to Word-Building Rule , 2002 .

[5]  Jun'ichi Tsujii,et al.  Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition , 2006, ACL.

[6]  Hai Zhao,et al.  Scaling Conditional Random Fields by One-Against-the-Other Decomposition , 2008, Journal of Computer Science and Technology.

[7]  Aitao Chen Chinese Word Segmentation Using Minimal Linguistic Knowledge , 2004, J. Chin. Lang. Comput..

[8]  Xiao Chen,et al.  The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging , 2008, IJCNLP.

[9]  Changning Huang,et al.  The Use of SVM for Chinese New Word Identification , 2004, IJCNLP.

[10]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[11]  Shiwen Yu,et al.  Specification for Corpus Processing at Peking University: Word Segmentation, POS Tagging and Phonetic Notation , 2003, J. Chin. Lang. Comput..

[12]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[13]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[14]  Zou Gang Internet-oriented Chinese New Words Detection , 2004 .

[15]  Yuji Matsumoto,et al.  Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing , 2006, J. Chin. Lang. Comput..

[16]  Xu Sun,et al.  Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression , 2008, Journal of Computer Science and Technology.

[17]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[18]  Yuji Matsumoto,et al.  Training Multi-Classifiers for Chinese Unknown Word Detection , 2005, J. Chin. Lang. Comput..

[19]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[21]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[22]  Yuji Matsumoto,et al.  Japanese Unknown Word Identification by Character-based Chunking , 2004, COLING.

[23]  Yuji Matsumoto,et al.  Chinese Unknown Word Identification Using Character-based Tagging and Chunking , 2003, ACL.

[24]  Xiao Sun,et al.  Detecting New Words from Chinese Text Using Latent Semi-CRF Models , 2010, IEICE Trans. Inf. Syst..

[25]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.