Chinese Unknown Word Identification Based on Local Bigram Model

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine these two models with different dimensions. As a simplification of bigram, this method is simple as well as feasible, since the complexity of its algorithm is quite low and not so many training corpora are needed. The results of our experiments show the solution is effective.

[1]  Qun Liu,et al.  Chinese Named Entity Recognition Using Role Model , 2003, ROCLING/IJCLCLP.

[2]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[3]  Tan Hong Research on Method of Automatic Recognition of Chinese Place Name Based on Transformation , 2001 .

[4]  Lei Zhang,et al.  Chinese Named Entity Identification Using Class-based Language Model , 2002, COLING.

[5]  Wanxiang Che,et al.  A New Chinese Natural Language Understanding Architecture Based on Multilayer Search Mechanism , 2004, SIGHAN@ACL.

[6]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[7]  Lv Ya Leveled Unknown Chinese Words Resolution by Dynamic Programming , 2001 .

[8]  Qun Liu,et al.  Automatic Recognition of Chinese Unknown Words Based on Roles Tagging , 2002, SIGHAN@COLING.

[9]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[10]  Guohong Fu,et al.  A Two-stage Statistical Word Segmentation System for Chinese , 2003, SIGHAN.

[11]  Qun Liu,et al.  Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.

[12]  Guohong Fu,et al.  Chinese Unknown Word Identification Using Class-Based LM , 2004, IJCNLP.

[13]  Shiwen Yu,et al.  Specification for Corpus Processing at Peking University: Word Segmentation, POS Tagging and Phonetic Notation , 2003, J. Chin. Lang. Comput..