Efficient one-vs-one kernel ridge regression for speech recognition

Recent evidences suggest that the performance of kernel methods may match that of deep neural networks (DNNs), which have been the state-of-the-art approach for speech recognition. In this work, we present an improvement of the kernel ridge regression studied in Huang et al., ICASSP 2014, and show that our proposal is computationally advantageous. Our approach performs classifications by using the one-vs-one scheme, which, under certain assumptions, reduces the costs of the one-vs-rest scheme by asymptotically a factor of c2 in training time and c in memory consumption. Here, c is the number of classes and it is typically on the order of hundreds and thousands for speech recognition. We demonstrate empirical results on the benchmark corpus TIMIT. In particular, the classification accuracy is one to two percentages higher (in the absolute term) than the best of the kernel methods and of the DNNs reported by Huang et al, and the speech recognition accuracy is highly comparable.

[1]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[2]  Brian Kingsbury,et al.  Arccosine kernels: Acoustic modeling with infinite neural networks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Haim Avron,et al.  High-Performance Kernel Machines With Implicit Distributed Optimization and Randomization , 2014, Technometrics.

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Brian Kingsbury,et al.  How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets , 2014, ArXiv.

[6]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[7]  Quanfu Fan,et al.  Random Laplace Feature Maps for Semigroup Kernels on Histograms , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[9]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[10]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[11]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[14]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[15]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[16]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[17]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.