Language model cross adaptation for LVCSR system combination

State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems often combine outputs from multiple sub-systems that may even be developed at different sites. Cross system adaptation, in which model adaptation is performed using the outputs from another sub-system, can be used as an alternative to hypothesis level combination schemes such as ROVER. Normally cross adaptation is only performed on the acoustic models. However, there are many other levels in LVCSR systems' modelling hierarchy where complimentary features may be exploited, for example, the sub-word and the word level, to further improve cross adaptation based system combination. It is thus interesting to also cross adapt language models (LMs) to capture these additional useful features. In this paper cross adaptation is applied to three forms of language models, a multi-level LM that models both syllable and word sequences, a word level neural network LM, and the linear combination of the two. Significant error rate reductions of 4.0-7.1% relative were obtained over ROVER and acoustic model only cross adaptation when combining a range of Chinese LVCSR sub-systems used in the 2010 and 2011 DARPA GALE evaluations.

[1]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[2]  Mark J. F. Gales,et al.  Language model combination and adaptation usingweighted finite state transducers , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Jean-Luc Gauvain,et al.  Improved models for Mandarin speech-to-text transcription , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  William J. Byrne,et al.  Discriminative language model adaptation for Mandarin broadcast speech transcription and translation , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[5]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[6]  Yong Qin,et al.  The 2009 IBM GALE Mandarin broadcast transcription system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Don McAllaster,et al.  Improvements in recognition of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Long Nguyen,et al.  Progress in the BBN 2007 Mandarin Speech to Text system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[10]  Philip C. Woodland,et al.  A PLSA-based language model for conversational telephone speech , 2004, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  Investigation of acoustic units for LVCSR systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[14]  Wei Wu,et al.  Development of the 2008 SRI Mandarin speech-to-text system for broadcast news and conversation , 2009, INTERSPEECH.

[15]  Xiaodong Cui,et al.  A comparative study on system combination schemes for LVCSR , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[17]  Ahmad Emami,et al.  Empirical study of neural network language models for Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[18]  Marcello Federico,et al.  Efficient language model adaptation through MDI estimation , 1999, EUROSPEECH.

[19]  Jean-Luc Gauvain,et al.  MODELING CHARACTERS VERSUS WORDS FOR MANDARIN SPEECH RECOGNITION , 2009 .

[20]  Mark J. F. Gales,et al.  Automatic complexity control for HLDA systems , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Mark J. F. Gales,et al.  Use of contexts in language model interpolation and adaptation , 2009, Comput. Speech Lang..

[23]  Jun Luo,et al.  Modeling characters versuswords for mandarin speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Richard M. Schwartz,et al.  Language Model Adaptation in Machine Translation from Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Mehryar Mohri,et al.  Network optimizations for large-vocabulary speech recognition , 1999, Speech Commun..

[26]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[27]  Gerhard Rigoll,et al.  Frame-discriminative and confidence-driven adaptation for LVCSR , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Marcello Federico,et al.  Language model adaptation through topic decomposition and MDI estimation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Mark J. F. Gales,et al.  Improved neural network based language modelling and adaptation , 2010, INTERSPEECH.

[30]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[31]  Mark J. F. Gales,et al.  Exploiting Chinese character models to improve speech recognition performance , 2009, INTERSPEECH.

[32]  Tim Ng,et al.  Jointly optimized discriminative features for speech recognition , 2010, INTERSPEECH.

[33]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[34]  Tasos Anastasakos,et al.  The use of confidence measures in unsupervised adaptation of speech recognizers , 1998, ICSLP.

[35]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[36]  Jean-Luc Gauvain,et al.  LANGUAGE MODEL ADAPTATION FOR BROADCAST NEWS TRANSCRIPTION , 2001 .

[37]  Jen-Tzung Chien,et al.  Bayesian learning for latent semantic analysis , 2005, INTERSPEECH.

[38]  Mikko Kurimo,et al.  Domain Adaptation of Maximum Entropy Language Models , 2010, ACL.

[39]  Bo-June Paul Hsu,et al.  Generalized linear interpolation of language models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[40]  Mark J. F. Gales,et al.  Product of Gaussians for speech recognition , 2006, Comput. Speech Lang..

[41]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[42]  Steve Young,et al.  The development of the 1996 HTK broadcast news transcription system , 1996 .

[43]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[44]  P.C. Woodland,et al.  The 1994 HTK large vocabulary speech recognition system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[45]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[48]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[49]  Herbert Gish,et al.  Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Philip C. Woodland,et al.  The development of the 1994 HTK large vocabulary speech recognition system , 1995 .

[51]  Mark J. F. Gales,et al.  Context dependent language model adaptation , 2008, INTERSPEECH.

[52]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[53]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[54]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[55]  Geoffrey E. Hinton Products of experts , 1999 .

[56]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[57]  Tanja Schultz,et al.  Dynamic language model adaptation using variational Bayes inference , 2005, INTERSPEECH.