An Efficiently Focusing Large Vocabulary Language Model

Accurate statistical language models are needed, for example, for large vocabulary speech recognition. The construction of models that are computationally efficient and able to utilize long-term dependencies in the data is a challenging task. In this article we describe how a topical clustering obtained by ordered maps of document collections can be utilized for the construction of efficiently focusing statistical language models. Experiments on Finnish and English texts demonstrate that considerable improvements are obtained in perplexity compared to a general n-gram model and to manually classified topic categories. In the speech recognition task the recognition history and the current hypothesis can be utilized to focus the model towards the current discourse or topic, and then apply the focused model to re-rank the hypothesis.

[1]  Philip Clarkson,et al.  Improved language modelling through better language model evaluation measures , 2001, Comput. Speech Lang..

[2]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[4]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[5]  Krista Lagus,et al.  Text Retrieval Using Self-Organized Document Maps , 2002, Neural Processing Letters.

[6]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[7]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[8]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[9]  Mikko Kurimo,et al.  Large vocabulary statistical language modeling for continuous speech recognition in finnish , 2001, INTERSPEECH.

[10]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[11]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.