Structured SVMs for Automatic Speech Recognition

Structured discriminative models are a flexible sequence classification approach that enable a wide variety of features to be used. This paper describes a particular model in this framework, structured support vector machines (SSVM), and how it can be applied to medium to large vocabulary speech recognition tasks. An important aspect of SSVMs is the form of the joint feature spaces. Here, context-dependent generative models, hidden Markov models, are used to obtain the features. To apply this form of combined generative and discriminative model to medium and larger vocabulary tasks, a number of issues need to be addressed. First, the features extracted are a function of the segmentation of the utterance. A Viterbi-like scheme for obtaining the “optimal” segmentation is described. Second, SSVMs can be viewed as large margin log linear models using a zero mean Gaussian prior of the discriminative parameter. However this form of prior is not appropriate for all features. A modified training algorithm is proposed that allows general Gaussian priors to be incorporated into the large margin criterion. Finally to speed up the training process, a 1-slack algorithm, caching competing hypotheses and parallelization strategies are also described. The performance of SSVMs is evaluated on small and medium to large speech recognition tasks: AURORA 2 and 4.

[1]  Xavier Carreras,et al.  Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[2]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[4]  Mark J. F. Gales,et al.  Factor analysis based VTS and JUD noise estimation and compensation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[6]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  David A. McAllester,et al.  Direct Error Rate Minimization of Hidden Markov Models , 2011, INTERSPEECH.

[8]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[9]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[10]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .

[11]  YoungSteve,et al.  The application of hidden Markov models in speech recognition , 2007 .

[12]  William J. Byrne Minimum Bayes Risk Estimation and Decoding in Large Vocabulary Continuous Speech Recognition , 2006, IEICE Trans. Inf. Syst..

[13]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[14]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[16]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[17]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[18]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[19]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Tomoko Matsui,et al.  Isolated-Word Recognition with Penalized Logistic Regression Machines , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[24]  Alex Acero,et al.  Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Mark J. F. Gales,et al.  Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  Geoffrey Zweig,et al.  A flat direct model for speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Mark J. F. Gales,et al.  Discriminative classifiers with adaptive kernels for noise robust speech recognition , 2010, Comput. Speech Lang..

[28]  Hank Liao,et al.  Joint uncertainty decoding for robust large vocabulary speech recognition , 2006 .

[29]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[30]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[31]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[32]  Shantanu Chakrabartty,et al.  Support vector machines for segmental minimum Bayes risk decoding of continuous speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[33]  Mark J. F. Gales,et al.  Augmented Statistical Models for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[34]  Mark J. F. Gales,et al.  Structured Support Vector Machines for Noise Robust Continuous Speech Recognition , 2011, INTERSPEECH.

[35]  Mark J. F. Gales,et al.  Derivative kernels for noise robust ASR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[36]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[37]  Mark J. F. Gales,et al.  Structured Log Linear Models for Noise Robust Speech Recognition , 2010, IEEE Signal Processing Letters.

[38]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[39]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[40]  Mark J. F. Gales,et al.  Structured discriminative models for noise robust continuous speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[42]  Georg Heigold,et al.  A log-linear discriminative modeling framework for speech recognition , 2010 .

[43]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Jinyu Li,et al.  Approximate Test Risk Bound Minimization Through Soft Margin Estimation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Geoffrey Zweig,et al.  From flat direct models to segmental CRF models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[47]  Mark J. F. Gales,et al.  Extending noise robust structured support vector machines to larger vocabulary tasks , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.