Multilingual stochastic n-gram class language models

Stochastic language models are widely used in continuous speech recognition systems where a priori probabilities of word sequences are needed. These probabilities are usually given by n-gram word models, estimated on very large training texts. When n increases, it becomes harder to find reliable statistics, even with huge texts. Grouping words is a way to overcome this problem. We have developed an automatic language independent classification procedure, which is able to optimize the classification of tens of millions of untagged words in less than a few hours on a Unix workstation. With this language independent approach, three corpora each containing about 30 million words of newspaper texts, in French, German and English, have been mapped into different numbers of classes. From these classifications, bi-gram and tri-gram class language models have been built. The perplexities of held-out test texts have been assessed, showing that tri-gram class models give lower values than those obtained with tri-gram word models, for the three languages.

[1]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[2]  Pascale Fung,et al.  The estimation of powerful language models from small and large corpora , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[4]  Gilles Adda,et al.  Automatic Determination of a Stochastic Bi-Gram Class Language Model , 1994, ICGI.

[5]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[6]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[7]  Christopher C. Huckle Grouping Words Using Statistical Context , 1995, EACL.

[8]  Gilles Adda,et al.  Automatic word classification using simulated annealing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.