Structured Discriminative Models For Speech Recognition: An Overview

Automatic speech recognition (ASR) systems classify structured sequence data, where the label sequences (sentences) must be inferred from the observation sequences (the acoustic waveform). The sequential nature of the task is one of the reasons why generative classifiers, based on combining hidden Markov model (HMM) acoustic models and N-gram language models using Bayes rule, have become the dominant technology used in ASR. Conversely, machine learning and natural language processing (NLP) research areas are increasingly dominated by discriminative approaches, where the class posteriors are directly modeled. This article describes recent work in the area of structured discriminative models for ASR. To handle continuous, variable length observation sequences, the approaches applied to NLP tasks must be modified. This article discusses a variety of approaches for applying structured discriminative models to ASR, both from the current literature and possible future approaches. We concentrate on structured models themselves, the descriptive features of observations commonly used within the models, and various options for optimizing the parameters of the model.

[1]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[2]  Mark J. F. Gales,et al.  Acoustic Modelling Using Continuous Rational Kernels , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[3]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[4]  Mark J. F. Gales,et al.  Discriminative classifiers with adaptive kernels for noise robust speech recognition , 2010, Comput. Speech Lang..

[5]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[6]  Geoffrey Zweig,et al.  Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hermann Ney,et al.  Discriminative adaptation for log-linear acoustic models , 2010, INTERSPEECH.

[8]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[9]  Mark J. F. Gales,et al.  Extending noise robust structured support vector machines to larger vocabulary tasks , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  A. Nadas,et al.  A decision theorectic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood , 1983 .

[11]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[12]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[13]  Hermann Ney,et al.  Feature selection for log-linear acoustic models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Georg Heigold,et al.  On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields , 2007, INTERSPEECH.

[15]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[16]  Mark J. F. Gales,et al.  Structured Log Linear Models for Noise Robust Speech Recognition , 2010, IEEE Signal Processing Letters.

[17]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[18]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Mark J. F. Gales,et al.  Derivative kernels for noise robust ASR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  Yuqing Gao,et al.  Maximum entropy direct models for speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[24]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[25]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[26]  Georg Heigold,et al.  Investigations on features for log-linear acoustic models in continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[28]  Geoffrey Zweig,et al.  Speech Recognition With Flat Direct Models , 2010, IEEE Journal of Selected Topics in Signal Processing.

[29]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[30]  John Shawe-Taylor,et al.  String Kernels, Fisher Kernels and Finite State Automata , 2002, NIPS.

[31]  Daniel Jurafsky,et al.  Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[32]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[33]  Ebru Arisoy,et al.  Syntactic and sub-lexical features for Turkish discriminative language models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[35]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[36]  Akinori Ito,et al.  Round-Robin Duel Discriminative Language Models , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  R. C. van Dalen cient decoding with continuous rational kernels using the expectation semiring , 2012 .

[38]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[39]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[40]  Shinji Watanabe,et al.  Large vocabulary continuous speech recognition using WFST-based linear classifier for structured data , 2010, INTERSPEECH.

[41]  William J. Byrne Minimum Bayes Risk Estimation and Decoding in Large Vocabulary Continuous Speech Recognition , 2006, IEICE Trans. Inf. Syst..

[42]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  M.J.F. Gales,et al.  Discriminative Models for Speech Recognition , 2007, 2007 Information Theory and Applications Workshop.

[44]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[45]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[46]  Ronald Rosenfeld,et al.  Efficient sampling and feature selection in whole sentence maximum entropy language models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[47]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[48]  Eric Fosler-Lussier,et al.  Conditional Random Fields for Integrating Local Discriminative Classifiers , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Izhak Shafran,et al.  Learning a Discriminative Weighted Finite-State Transducer for Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Daniel Jurafsky,et al.  Hidden Conditional Random Fields for phone recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[51]  Brian Roark,et al.  Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.

[52]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[53]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[54]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[55]  Daniel Jurafsky,et al.  Maximum conditional likelihood linear regression and maximum a posteriori for hidden conditional random fields speaker adaptation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[57]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[58]  Steve Renals,et al.  Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.