On Compressing N-Gram Language Models

In large-vocabulary speech recognition systems, the major part of memory resources is typically consumed by a large n-gram language model. Representing the language model compactly is important in recognition systems targeted for small devices with limited memory resources. This paper extends the compressed language model structure proposed earlier by Whittaker and Raj. By separating n-grams that are prefixes to longer n-grams, redundant information can be omitted. Experiments on English 4-gram models and Finnish 6-gram models show that extended structure can achieve up to 30% lossless memory reductions when compared to baseline structure of Whittaker and Raj.

[1]  Bhiksha Raj,et al.  Quantization-based language model compression , 2001, INTERSPEECH.

[2]  Bhiksha Raj,et al.  Lossless compression of language model structure and word identifiers , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Jesper Ø. Olsen,et al.  Profile Based Compression of N-Gram Language Models , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[5]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[6]  Bhiksha Raj,et al.  Comparison of width-wise and length-wise language model compression , 2001, INTERSPEECH.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[9]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[10]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[11]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[12]  Jianfeng Gao,et al.  Language model size reduction by pruning and clustering , 2000, INTERSPEECH.

[13]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.