Toward a unified approach to lexicon optimization and perplexity minimization for Chinese language modeling

This paper presents a unified approach to lexicon optimization and perplexity minimization for Chinese language modeling (LM). Instead of using a non-iterative segmentation-detection method, the proposed approach iteratively extracts candidate words, selects new words based on a perplexity minimization criterion and adds the new words into the lexicon. The augmented lexicon, which contains the new words, is used in the next iteration to re-segment the input corpus until the perplexity of the LM is converged. The experiments show that both the precision and recall rates are improved and the perplexity of the LM has reduced 6.3%.

[1]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Zhiyong Luo,et al.  An Integrated Method for Chinese Unknown Word Extraction , 2004, SIGHAN@ACL.

[3]  Maosong Sun,et al.  Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures , 2003, SIGHAN.

[4]  Jie Zhu,et al.  Bootstrap method for Chinese new words extraction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Lin-Shan Lee,et al.  Statistics-based segment pattern lexicon-a new direction for Chinese language modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Jianfeng Gao,et al.  Lexicon Optimization for Chinese Language Modeling , 2000 .

[7]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[8]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[9]  Alexander H. Waibel,et al.  Class phrase models for language modeling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Bo Xu,et al.  Chinese Named Entity Recognition Combining Statistical Model wih Human Knowledge , 2003, NER@ACL.