A chunking strategy towards unknown word detection in chinese word segmentation

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy model can be significantly reduced without performance decrease. This makes it possible for further improving the performance by considering more various types of contexts. Evaluation on the PK and CTB corpora in the First SIGHAN Chinese word segmentation bakeoff shows that our chunking approach successfully detects about 80% of unknown words on both of the corpora and outperforms the best-reported systems by 8.1% and 7.1% in unknown word detection on them respectively.

[1]  Aitao Chen Chinese Word Segmentation Using Minimal Linguistic Knowledge , 2004, J. Chin. Lang. Comput..

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  James D. Seymour Human Rights and the World Response To the 1989 Crackdown in China , 1990 .

[7]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[8]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[9]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[10]  Andi Wu Chinese Word Segmentation in MSR-NLP , 2003, SIGHAN.

[11]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[12]  Yu Shiwen,et al.  Chinese Word Segmentation at Peking University , 2003 .

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[14]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[15]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.