On quantifying the quality of acoustic models in hybrid DNN-HMM ASR

Abstract We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional emission probabilities for the observations at each time step, and (ii) the conditional probability distribution of the data given the underlying hidden state is independent of any other state in the sequence. The latter property is also known as the Markovian conditional independence assumption of HMM based modeling. In this work, we cast HMM based ASR as a communication channel in which the acoustic model computes the state emission probabilities as the input of the channel and the channel outputs the most probable hidden state sequence. The quality of the acoustic model is thus quantified in terms of the amount of information transmitted through this channel as well as how robust this channel is against the mismatch between the data and HMM’s conditional independence assumption. To formulate the required information theoretic terms, we utilize the gamma posterior (or state occupancy) probabilities of HMM hidden states to derive a simple and straightforward analysis framework which assesses the benefits and shortcomings of various acoustic models in HMM based ASR. Our approach enables us to analyse acoustic modeling with Gaussian mixture models (GMM) as well as deep neural networks (DNN) (with different number of hidden layers) without actually evaluating their ASR performance explicitly. As use cases, we apply our analysis on sequence discriminatively trained DNN acoustic models as well as state-of-the-art recurrent and time-delay neural networks to compare their efficacy as acoustic models in HMM based ASR. In addition, we also use our analysis to study the contribution of sparse and low-dimensional models in enhancing acoustic modeling for better compliance with the HMM requirements.

[1]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Hynek Hermansky,et al.  Analysis of MLP-Based Hierarchical Phoneme Posterior Probability Estimator , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[4]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[7]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[9]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Larry Gillick,et al.  Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Suman V. Ravuri,et al.  How Neural Network Depth Compensates for HMM Conditional Independence Assumptions in DNN-HMM Acoustic Models , 2016, INTERSPEECH.

[12]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[13]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[14]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Hervé Bourlard,et al.  Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hervé Bourlard,et al.  Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition , 2016, Speech Commun..

[18]  Frank Fallside,et al.  A recurrent error propagation network speech recognition system , 1991 .

[19]  Hervé Bourlard,et al.  Exploiting Eigenposteriors for Semi-Supervised Training of DNN Acoustic Models with Sequence Discrimination , 2017, INTERSPEECH.

[20]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Samy Bengio,et al.  Developing and enhancing posterior based speech recognition systems , 2005, INTERSPEECH.

[23]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[24]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[25]  Yifan Gong,et al.  A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models , 2014, INTERSPEECH.

[26]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  Jeff A. Bilmes,et al.  Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Hervé Bourlard,et al.  Low-Rank Representation of Nearest Neighbor Phone Posterior Probabilities to Enhance DNN Acoustic Modeling , 2016, Interspeech 2016.

[29]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[31]  Hervé Bourlard,et al.  Low-rank and sparse soft targets to learn better DNN acoustic models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[34]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[35]  Jeff A. Bilmes,et al.  WHAT HMMS CAN'T DO , 2004 .

[36]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[37]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[38]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[39]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[40]  Larry Gillick,et al.  Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[41]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..