Bag-of-word normalized n-gram models

The Bag-Of-Word (BOW) model uses a fixed length vector of word counts to represent text. Although the model disregards word sequence information, it has been shown to be successful in capturing long range word-word correlations and topic information. In contrast, n-gram models have been shown to be an effective way to capture short term dependencies by modeling text as a Markovian sequence. In this paper, we propose a probabilistic framework to combine BOW models with n-gram models. In the proposed framework, we normalize the n-gram model to build a model for word sequences given the corresponding bag-of-words representation. By combining the two models, the proposed approach allows us to capture the latent topic information as well as local Markovian dependencies in text. Using the proposed model, we were able to achieve a 10% reduction in perplexity and a 2% reduction in WER (relative) over a state-of-the-art baseline for transcribing broadcast news in English.

[1]  Andreas Stolcke,et al.  Integrating MAP, marginals, and unsupervised language model adaptation , 2007, INTERSPEECH.

[2]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  James R. Glass,et al.  Style & Topic Language Model Adaptation Using HMM-LDA , 2006, EMNLP.

[4]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Lin-Shan Lee,et al.  Robust topic inference for latent semantic language model adaptation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  Philip C. Woodland,et al.  Unsupervised language model adaptation for Mandarin broadcast conversation transcription , 2006, INTERSPEECH.

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Feifan Liu,et al.  Unsupervised language model adaptation via topic modeling based on named entity hypotheses , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Bhuvana Ramabhadran,et al.  The IBM 2007 speech transcription system for European parliamentary speeches , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Tatsuya Kawahara,et al.  Automatic lecture transcription by exploiting presentation slide information for language model adaptation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[13]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[14]  Tanja Schultz,et al.  Correlated Latent Semantic Model for Unsupervised LM Adaptation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[16]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[17]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.