SVMs for Automatic Speech Recognition: A Survey

Hidden Markov Models (HMMs) are, undoubtedly, the most employed core technique for Automatic Speech Recognition (ASR). Nevertheless, we are still far from achieving high-performance ASR systems. Some alternative approaches, most of them based on Artificial Neural Networks (ANNs), were proposed during the late eighties and early nineties. Some of them tackled the ASR problem using predictive ANNs, while others proposed hybrid HMM/ANN systems. However, despite some achievements, nowadays, the preponderance of Markov Models is a fact. During the last decade, however, a new tool appeared in the field of machine learning that has proved to be able to cope with hard classification problems in several fields of application: the Support Vector Machines (SVMs). The SVMs are effective discriminative classifiers with several outstanding characteristics, namely: their solution is that with maximum margin; they are capable to deal with samples of a very higher dimensionality; and their convergence to the minimum of the associated cost function is guaranteed. These characteristics have made SVMs very popular and successful. In this chapter we discuss their strengths and weakness in the ASR context and make a review of the current state-of-the-art techniques. We organize the contributions in two parts: isolated-word recognition and continuous speech recognition. Within the first part we review several techniques to produce the fixed-dimension vectors needed for original SVMs. Afterwards we explore more sophisticated techniques based on the use of kernels capable to deal with sequences of different length. Among them is the DTAK kernel, simple and effective, which rescues an old technique of speech recognition: Dynamic Time Warping (DTW). Within the second part, we describe some recent approaches to tackle more complex tasks like connected digit recognition or continuous speech recognition using SVMs. Finally we draw some conclusions and outline several ongoing lines of research.

[1]  Boonserm Kijsirikul,et al.  Support Vector Machines for Thai Phoneme Recognition , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Fernando Díaz-de-María,et al.  Support Vector Machines for continuous speech recognition , 2006, 2006 14th European Signal Processing Conference.

[3]  Pedro J. Moreno,et al.  On the use of support vector machines for phonetic classification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[4]  Frank Fallside,et al.  A recurrent error propagation network speech recognition system , 1991 .

[5]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Joseph Picone,et al.  Hybrid SVM/HMM architectures for speech recognition , 2000, INTERSPEECH.

[7]  O. Bousquet,et al.  Kernel methods and their potential use in signal processing , 2004, IEEE Signal Processing Magazine.

[8]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[9]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[10]  Alex Waibel,et al.  Continuous speech recognition using linked predictive neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[12]  Steve Renals,et al.  Speaker verification using sequence discriminant support vector machines , 2005, IEEE Transactions on Speech and Audio Processing.

[13]  A. Benyettou,et al.  Lagrangian support vector machines for phoneme classification , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[14]  Mahesan Niranjan,et al.  Data-dependent kernels in svm classification of speech patterns , 2000, INTERSPEECH.

[15]  A. Juneja,et al.  Segmentation of continuous speech using acoustic-phonetic parameters and statistical learning , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[16]  Fernando Pérez-Cruz,et al.  SVM classifiers for ASR: A discussion about parameterization , 2004, 2004 12th European Signal Processing Conference.

[17]  Hervé Bourlard,et al.  Continuous speech recognition by connectionist statistical methods , 1993, IEEE Trans. Neural Networks.

[18]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[19]  Ken-ichi Iso,et al.  Speaker-independent word recognition using a neural prediction model , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[21]  Ken-ichi Iso,et al.  Speaker-independent word recognition using dynamic programming neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[22]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[23]  Fernando Pérez-Cruz,et al.  Weighted least squares training of support vector classifiers leading to compact and adaptive schemes , 2001, IEEE Trans. Neural Networks.

[25]  B. Yegnanarayana,et al.  Combining evidence from multiple classifiers for recognition of consonant-vowel units of speech in multiple languages , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[26]  Koby Crammer,et al.  Advances in Neural Information Processing Systems 14 , 2002 .

[27]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[28]  Joseph Picone,et al.  Applications of support vector machines to speech recognition , 2004, IEEE Transactions on Signal Processing.

[29]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[30]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[31]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[32]  Aníbal R. Figueiras-Vidal,et al.  Growing support vector classifiers with controlled complexity , 2003, Pattern Recognit..

[33]  Shigeki Sagayama,et al.  Dynamic Time-Alignment Kernel in Support Vector Machine , 2001, NIPS.

[34]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[35]  Marco Gori,et al.  A survey of hybrid ANN/HMM models for automatic speech recognition , 2001, Neurocomputing.

[36]  Piero Cosi Hybrid HMM-NN architectures for connected digit recognition , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[37]  S. Levinson,et al.  Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[38]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[39]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[40]  R. Fletcher Practical Methods of Optimization , 1988 .

[41]  Mark A. Randolph,et al.  A support vector machines-based rejection technique for speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[42]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[43]  Carmen Peláez-Moreno,et al.  A Speech Recognizer Based on Multiclass SVMs with HMM-Guided Segmentation , 2005, NOLISP.

[44]  Carmen Peláez-Moreno,et al.  Robust ASR using Support Vector Machines , 2007, Speech Commun..

[45]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[46]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[47]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[48]  Shai Fine,et al.  A hybrid GMM/SVM approach to speaker identification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[49]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[50]  Samy Bengio,et al.  Client Dependent GMM-SVM Models for Speaker Verification , 2003, ICANN.

[51]  G. Ruske,et al.  A hybrid RBF-HMM system for continuous speech recognition , 1995 .

[52]  Mark J. F. Gales,et al.  Using SVMS and discriminative models for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[54]  Shigeki Sagayama,et al.  Support vector machine with dynamic time-alignment kernel for speech recognition , 2001, INTERSPEECH.