Use of contexts in language model interpolation and adaptation

Language models (LMs) are often constructed by building multiple individual component models that are combined using context independent interpolation weights. By tuning these weights, using either perplexity or discriminative approaches, it is possible to adapt LMs to a particular task. This paper investigates the use of context dependent weighting in both interpolation and test-time adaptation of language models. Depending on the previous word contexts, a discrete history weighting function is used to adjust the contribution from each component model. As this dramatically increases the number of parameters to estimate, robust weight estimation schemes are required. Several approaches are described in this paper. The first approach is based on MAP estimation where interpolation weights of lower order contexts are used as smoothing priors. The second approach uses training data to ensure robust estimation of LM interpolation weights. This can also serve as a smoothing prior for MAP adaptation. A normalized perplexity metric is proposed to handle the bias of the standard perplexity criterion to corpus size. A range of schemes to combine weight information obtained from training data and test data hypotheses are also proposed to improve robustness during context dependent LM adaptation. In addition, a minimum Bayes' risk (MBR) based discriminative training scheme is also proposed. An efficient weighted finite state transducer (WFST) decoding algorithm for context dependent interpolation is also presented. The proposed technique was evaluated using a state-of-the-art Mandarin Chinese broadcast speech transcription task. Character error rate (CER) reductions up to 7.3% relative were obtained as well as consistent perplexity improvements.

[1]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[2]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[3]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[4]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[5]  Mark J. F. Gales,et al.  Language model combination and adaptation usingweighted finite state transducers , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Ahmad Emami,et al.  Random clusterings for language modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Thorsten Brants,et al.  Test Data Likelihood for PLSA Models , 2005, Information Retrieval.

[8]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[9]  Philip C. Woodland,et al.  Unsupervised language model adaptation for Mandarin broadcast conversation transcription , 2006, INTERSPEECH.

[10]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[11]  Mingjing Li,et al.  Discriminative training on language model , 2000, INTERSPEECH.

[12]  Roger K. Moore Computer Speech and Language , 1986 .

[13]  MohriMehryar,et al.  Weighted finite-state transducers in speech recognition , 2002 .

[14]  Jen-Tzung Chien,et al.  Bayesian learning for latent semantic analysis , 2005, INTERSPEECH.

[15]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[16]  Jochen Peters,et al.  Semantic clustering for adaptive language modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Isabel Trancoso,et al.  A specialized on-the-fly algorithm for lexicon and language model composition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[19]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[20]  Geoffrey Zweig,et al.  Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[22]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[23]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[24]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[25]  Richard M. Schwartz,et al.  Language Model Adaptation in Machine Translation from Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[27]  Mehryar Mohri,et al.  Network optimizations for large-vocabulary speech recognition , 1999, Speech Commun..

[28]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[29]  Marcello Federico,et al.  Language model adaptation through topic decomposition and MDI estimation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[31]  Frederick Jelinek,et al.  Improved clustering techniques for class-based statistical language modeling , 1999 .

[32]  Bo-June Paul Hsu,et al.  Generalized linear interpolation of language models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[33]  Reinhard Kneser,et al.  On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  William J. Byrne,et al.  Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition , 2006, Speech Commun..

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[38]  Mark J. F. Gales,et al.  The Cu-Htk Mandarin Broadcast News Transcription System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[39]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[40]  William J. Byrne,et al.  Discriminative language model adaptation for Mandarin broadcast speech transcription and translation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[41]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[42]  Jean-Luc Gauvain,et al.  LANGUAGE MODEL ADAPTATION FOR BROADCAST NEWS TRANSCRIPTION , 2001 .

[43]  Mark J. F. Gales,et al.  Context dependent language model adaptation , 2008, INTERSPEECH.

[44]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[45]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Mark J. F. Gales,et al.  Exploiting Chinese character models to improve speech recognition performance , 2009, INTERSPEECH.

[47]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[48]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[49]  Steve Young,et al.  The development of the 1996 HTK broadcast news transcription system , 1996 .

[50]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[51]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[52]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[53]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[54]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[55]  Dietrich Klakow,et al.  An algorithm for fast composition of weighted finite-state transducers , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[56]  Chin-Hui Lee,et al.  Discriminative training of language models for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[58]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[59]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[60]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..

[61]  Sadaoki Furui,et al.  Generalization of specialized on-the-fly composition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62]  Tanja Schultz,et al.  Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.

[63]  Marcello Federico,et al.  Efficient language model adaptation through MDI estimation , 1999, EUROSPEECH.

[64]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[65]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures vs. dynamic cache models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[66]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[67]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[68]  Andrej Ljolje,et al.  Full expansion of context-dependent networks in large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[69]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[70]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[71]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[72]  Philip C. Woodland,et al.  A PLSA-based language model for conversational telephone speech , 2004, INTERSPEECH.

[73]  Mark J. F. Gales,et al.  Automatic complexity control for HLDA systems , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[74]  François Yvon,et al.  Discriminative training of finite state decoding graphs , 2005, INTERSPEECH.

[75]  Mathew Magimai-Doss,et al.  A Generalized Dynamic Composition Algorithm of Weighted Finite State Transducers for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.