Acoustic modelling for speech recognition: Hidden Markov models and beyond?

Hidden Markov models (HMMs) are still the dominant form of acoustic model used in automatic speech recognition (ASR) systems. However over the years the form, and training, of the HMM for ASR have been extended and modified, so that the current forms used in state-of-the-art speech recognition systems are very different to those originally proposed thirty years ago. This talk will review two of the more important extensions that have been proposed over the years: discriminative training; and speaker and environment adaptation. The use of discriminative training is now common with forms based on minimum Bayes' training and minimum classification error being applied to systems trained on many hundreds of hours of speech data. The talk will describe these current approaches, as well as discussing the current trends towards schemes based on large-margin training approaches. Linear transform based speaker adaptation is the dominant form for speaker adaptation. Current approaches, including extensions to linear transforms and model-based noise robustness techniques, and trends will also be described. Details of the various forms of the adaptation/noise transformation, training criterion and approaches for adaptive training will be given. The final part of the talk will discuss research beyond the current HMM framework. Schemes based on both discriminative models and functions, as well as non-parametric approaches will be described.

[1]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[3]  George Saon,et al.  Penalty function maximization for large margin HMM training , 2008, INTERSPEECH.

[4]  Jinyu Li,et al.  Approximate Test Risk Minimization Through Soft Margin Estimation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Mark J. F. Gales,et al.  Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noisy Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Mark J. F. Gales,et al.  Incremental predictive and adaptive noise compensation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  William J. Byrne Minimum Bayes Risk Estimation and Decoding in Large Vocabulary Continuous Speech Recognition , 2006, IEICE Trans. Inf. Syst..

[12]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[13]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[14]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[15]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[16]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[17]  Shantanu Chakrabartty,et al.  Support vector machines for segmental minimum Bayes risk decoding of continuous speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[18]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[19]  Alex Acero,et al.  Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Georg Heigold,et al.  Modified MMI/MPE: a direct evaluation of the margin in speech recognition , 2008, ICML '08.

[21]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .

[22]  Mark J. F. Gales,et al.  Bayesian Adaptive Inference and Adaptive Training , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Yu Hu,et al.  Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions , 2007, INTERSPEECH.

[24]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[25]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[26]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Georg Heigold,et al.  On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields , 2007, INTERSPEECH.

[29]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[30]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.