Tied and Regularized Conditional Gaussian Graphical Models for Acoustic Modeling in ASR

Most automatic speech recognition (ASR) systems express probability densities over sequences of acoustic feature vectors using Gaussian or Gaussian-mixture hiddenMarkov models. In this chapter, we explore how graphical models can help describe a variety of tied (i.e., parameter shared) and regularized Gaussian mixture systems. Unlike many previous such tied systems, however, here we allow sub-portions of the Gaussians to be tied in arbitrary ways. The space of such models includes regularized, tied, and adaptive versions of mixture conditional Gaussian models and also a regularized version of maximum-likelihood linear regression (MLLR). We derive expectation-maximization (EM) update equations and explore consequences to the training algorithm under relevant variants of the equations. In particular, we find that for certain combinations of regularization and/or tying, it is no longer the case that we may achieve a closed-form analytic solution to the EM update equations. We describe, however, a generalized EM (GEM) procedure that will still increase the likelihood and has the same fixed-points as the standard EM algorithm.

[1]  M. T. Qureshi,et al.  Lyapunov Matrix Equation in System Stability and Control , 2008 .

[2]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Illtyd Trethowan Causality , 1938 .

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[7]  Philip C. Woodland,et al.  Speaker adaptation: techniques and challenges , 1999 .

[8]  Lawrence R. Rabiner,et al.  A minimum discrimination information approach for hidden Markov modeling , 1989, IEEE Trans. Inf. Theory.

[9]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[10]  Yariv Ephraim,et al.  s1.2 On the Relations Between Modeling Approaches for Information Sources , 1988 .

[11]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[12]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jeff A. Bilmes,et al.  Buried Markov models: a graphical-modeling approach to automatic speech recognition , 2003, Comput. Speech Lang..

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  Thomas L. Dean,et al.  Probabilistic Temporal Reasoning , 1988, AAAI.

[17]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[18]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[19]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[20]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[21]  Paul Mermelstein,et al.  Experiments in syllable-based recognition of continuous speech , 1980, ICASSP.

[22]  A. Dawid Conditional Independence in Statistical Theory , 1979 .

[23]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[24]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[25]  R. Fletcher Practical Methods of Optimization , 1988 .

[26]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[27]  Vassilios Digalakis,et al.  A comparative study of speaker adaptation techniques , 1995, EUROSPEECH.

[28]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[30]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[31]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[32]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[33]  Chin-Hui Lee,et al.  Maximum a posteriori linear regression for hidden Markov model adaptation , 1999, EUROSPEECH.

[34]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[35]  Daniel P. W. Ellis,et al.  Speech feature smoothing for robust ASR , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[37]  David Heckerman,et al.  Models and Selection Criteria for Regression and Classification , 1997, UAI.

[38]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[39]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[40]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[41]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[42]  Alan J. Laub,et al.  Solution of the Sylvester matrix equation AXBT + CXDT = E , 1992, TOMS.

[43]  Jeff A. Bilmes,et al.  Buried Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[44]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[45]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[46]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[47]  Peder A. Olsen,et al.  Modeling inverse covariance matrices by basis expansion , 2002, IEEE Transactions on Speech and Audio Processing.

[48]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[49]  Steve Young,et al.  A review of large-vocabulary continuous-speech recognition , 1996 .

[50]  Xiao Li,et al.  Maximum margin learning and adaptation of MLP classifiers , 2005, INTERSPEECH.

[51]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[52]  Wu Chou,et al.  Maximum a posteriori linear regression (MAPLR) variance adaptation for continuous density HMMS , 2003, INTERSPEECH.

[53]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[54]  George Saon,et al.  A Non-Linear Speaker Adaptation Technique using Kernel Ridge Regression , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[55]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[56]  Jeff A. Bilmes,et al.  Factored sparse inverse covariance matrices , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[57]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[58]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .