Language modelling for Russian and English using words and classes

This paper examines statistical language modelling of Russian and English in the context of automatic speech recognition. The characteristics of both a Russian and an English text corpus of similar composition are discussed with reference to the properties of both languages. In particular, it is shown that to achieve the same vocabulary coverage as a 65,000 word vocabulary for English, a 430,000 word vocabulary is required for Russian. The implications of this observation motivate the remainder of the paper. Perplexity experiments are reported for word-based N-gram modelling of the two languages and the differences are examined. It is found that, in contrast to English, there is little gain in using 4-grams over trigrams for modelling Russian. Class-based N-gram modelling is then considered and perplexity experiments are reported for two different types of class models, a two-sided model and a novel, one-sided model for which classes are generated automatically. Word and class model combinations show the two-sided model results in lower perplexities than combinations with the one-sided model. However, the very large Russian vocabulary favours the use of the one-sided model since the clustering algorithm, used to obtain word classes automatically, is significantly faster. Lattice rescoring experiments are then reported on an English-language broadcast news task which show that both combinations of the word model with either type of class model produce identical reductions in word error rate.

[1]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Thomas Hain,et al.  The 1997 HTK broadcast news transcription system , 1998 .

[3]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[4]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[5]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[6]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.