Speech Recognition Using Augmented Conditional Random Fields

Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT phone recognition task, a phone error rate of 23.0% was recorded on the full test set, a significant improvement over comparable HMM-based systems.

[1]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[2]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[3]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[6]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[7]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[10]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[11]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[13]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[14]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[17]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[18]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[20]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[21]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[22]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[24]  Enrico Bocchieri,et al.  Vector quantization for the efficient computation of continuous density likelihoods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[26]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[27]  Steve J. Young,et al.  MMI training for continuous phoneme recognition on the TIMIT database , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[29]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[30]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[32]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[33]  Steve Young A review of large-vocabulary continuous-speech , 1996 .

[34]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[35]  Chin-Hui Lee,et al.  Simultaneous ANN feature and HMM recognizer design using string-based minimum classification error (MCE) training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[37]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[39]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[40]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[41]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[42]  Sadik Kapadia,et al.  Discriminative Training of Hidden Markov Models , 1998 .

[43]  Jeff A. Bilmes,et al.  Data-driven extensions to HMM statistical dependencies , 1998, ICSLP.

[44]  Francis Jack Smith,et al.  Improved phone recognition using Bayesian triphone models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[45]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[46]  Jeff A. Bilmes,et al.  Buried Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[47]  Mark J. F. Gales,et al.  State-based Gaussian selection in large vocabulary continuous speech recognition using HMMs , 1999, IEEE Trans. Speech Audio Process..

[48]  Mahesan Niranjan,et al.  Data-dependent kernels in svm classification of speech patterns , 2000, INTERSPEECH.

[49]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[50]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[51]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[52]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[53]  Hermann Ney,et al.  Comparison of discriminative training criteria and optimization methods for speech recognition , 2001, Speech Commun..

[54]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[55]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[56]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[57]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[58]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[59]  Yuqing Gao,et al.  Direct models for phoneme recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  Hermann Ney,et al.  A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition , 2003, INTERSPEECH.

[61]  Mark Hasegawa-Johnson,et al.  Approximately independent factors of speech using nonlinear symplectic transformation , 2003, IEEE Trans. Speech Audio Process..

[62]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[63]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[64]  Neil D. Lawrence,et al.  Acoustic space dimensionality selection and combination using the maximum entropy principle , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Neil D. Lawrence,et al.  A hybrid Maxent/HMM based ASR system , 2005, INTERSPEECH.

[66]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[67]  Daniel Povey,et al.  Improvements to fMPE for discriminative training of features , 2005, INTERSPEECH.

[68]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[69]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[70]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[71]  Mark J. F. Gales,et al.  Augmented Statistical Models for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[72]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[73]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[74]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[75]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[76]  Lawrence K. Saul,et al.  Comparison of Large Margin Training to Other Discriminative Methods for Phonetic Recognition by Hidden Markov Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[77]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.