Lexicon Optimization for Chinese Language Modeling

In this paper, we present an approach to lexicon optimization for Chinese language modeling. The method is an iterative procedure consisting of two phases, namely lexicon generation and lexicon pruning. In the first phase, we extract appropriate new words from a very large training corpus by statistical approaches. In the second phase, we prune the lexicon to a preset memory limitation using a perplexity minimization criterion. Experimental results show up to 6% character perplexity reduction comparing to the baseline lexicon.

[1]  Keh-Yih Su,et al.  Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count , 1993, ROCLING/IJCLCLP.

[2]  Lee-Feng Chien,et al.  PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval , 1999, Inf. Process. Manag..

[3]  Lin-Shan Lee,et al.  Statistics-based segment pattern lexicon-a new direction for Chinese language modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[5]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[6]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[7]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[9]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Victor Zue Navigating the Information Superhighway Using Spoken Language Interfaces , 1995, IEEE Expert.

[11]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[12]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[13]  Jianfeng Gao,et al.  A unified approach to statistical language modeling for Chinese , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).