Effective pseudo-relevance feedback for language modeling in speech recognition

A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.

[1]  Kuan-Yu Chen,et al.  Effective pseudo-relevance feedback for spoken document retrieval , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[3]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[4]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[5]  C. Elkan,et al.  Topic Models , 2008 .

[6]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Georg Heigold,et al.  Generalized likelihood ratio discriminant analysis , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[9]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[10]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Hsin-Min Wang,et al.  MATBN: A Mandarin Chinese Broadcast News Corpus , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[13]  Berlin Chen,et al.  Generalized likelihood ratio discriminant analysis , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Berlin Chen,et al.  Minimum word error based discriminative training of language models , 2005, INTERSPEECH.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Kuan-Yu Chen,et al.  Leveraging relevance cues for language modeling in speech recognition , 2013, Inf. Process. Manag..

[17]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[18]  Rudolf Kruse,et al.  Relevance Feedback for Association Rules by Leveraging Concepts from Information Retrieval , 2007, SGAI Conf..

[19]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[20]  Tatsuya Kawahara,et al.  Trigger-based language model adaptation for automatic meeting transcription , 2005, INTERSPEECH.

[21]  Berlin Chen,et al.  Lightly supervised and data-driven approaches to Mandarin broadcast news transcription , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[23]  Kuan-Yu Chen,et al.  Incorporating proximity information for relevance language modeling in speech recognition , 2013, INTERSPEECH.

[24]  Berlin Chen,et al.  Training data selection for improving discriminative training of acoustic models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[25]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[28]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[29]  Kuan-Yu Chen,et al.  Spoken Document Retrieval With Unsupervised Query Modeling Techniques , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[31]  Atsushi Nakamura,et al.  A comparative study on methods of Weighted language model training for reranking lvcsr N-best hypotheses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Tanja Schultz,et al.  Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.

[33]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.