Random features for Kernel Deep Convex Network

The recently developed deep learning architecture, a kernel version of the deep convex network (K-DCN), is improved to address the scalability problem when the training and testing samples become very large. We have developed a solution based on the use of random Fourier features, which possess the strong theoretical property of approximating the Gaussian kernel while rendering efficient computation in both training and evaluation of the K-DCN with large training samples. We empirically demonstrate that just like the conventional K-DCN exploiting rigorous Gaussian kernels, the use of random Fourier features also enables successful stacking of kernel modules to form a deep architecture. Our evaluation experiments on phone recognition and speech understanding tasks both show the computational efficiency of the K-DCN which makes use of random features. With sufficient depth in the K-DCN, the phone recognition accuracy and slot-filling accuracy are shown to be comparable or slightly higher than the K-DCN with Gaussian kernels while significant computational saving has been achieved.

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[3]  Li Deng,et al.  Are Sparse Representations Rich Enough for Acoustic Modeling? , 2012, INTERSPEECH.

[4]  Dong Yu,et al.  Deep Convex Net: A Scalable Architecture for Speech Pattern Classification , 2011, INTERSPEECH.

[5]  Narendra Ahuja,et al.  Online learning with kernels: Overcoming the growing sum problem , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[6]  Dong Yu,et al.  Scalable stacking and learning for building deep architectures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Gökhan Tür,et al.  What is left to be understood in ATIS? , 2010, 2010 IEEE Spoken Language Technology Workshop.

[8]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[9]  Giuseppe Riccardi,et al.  Generative and discriminative algorithms for spoken language understanding , 2007, INTERSPEECH.

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[12]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[13]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[14]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[17]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[18]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[19]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Bernhard Schölkopf,et al.  Improving the accuracy and speed of support vector learning machines , 1997, NIPS 1997.

[21]  G. Baudat,et al.  Feature vector selection and projection using kernels , 2003, Neurocomputing.

[22]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..