Characteristics of Chinese language models for large vocabulary telephone speech

This paper is concerned with language modeling (LM) for large vocabulary speech recognition in Mandarin Chinese. As the language characteristics of Chinese are quite unique, we investigate some novel techniques in language modeling. We also borrow some of techniques that have been applied to other languages. Experiments have been conducted on the Call Home Mandarin, HUB4, and HUB5 corpora obtained from the Linguistic Data Consortium (LDC). The training set consists of 9.8 hours of spontaneous speech and 100K words in text. The test set consists of 1.6 hours of spontaneous speech and 20K words in text. We have found that our results compare favorably to the results reported in the literature.

[1]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[3]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[4]  Hermann Ney,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Kyuwoong Hwang Vocabulary optimization based on perplexity , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Michael Picheny,et al.  Speech recognition on Mandarin Call Home: a large-vocabulary, conversational, and telephone speech corpus , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.