Emotion Recognition from Varying Length Patterns of Speech using CNN-based Segment-Level Pyramid Match Kernel based SVMs

Convolutional Neural Networks (CNNs) and its variants have achieved impressive performance when used for different speech processing tasks like spoken language identification, speaker verification, speech emotion recognition, etc. Conventionally, CNNs for speech applications consider input features from fixed duration speech segments as input. In this work, we attempt to consider features from complete speech signal as input to CNN. We propose to use spatial pyramid pooling (SPP) layer in CNN architecture to remove the fixed length constraint and to consider features from varying length speech signals as input to CNN for an end to end training. Proposed architecture also results in varying size set of feature maps from convolution layer. Further, we propose novel CNN-based segment-level pyramid match kernel (CNN-SLPMK) as dynamic kernel between a pair of varying size set of feature maps for the classification framework using support vector machines (SVMs) based classifier. We demonstrate that our proposed approach achieves comparable results with state-of-the-art techniques for speech emotion recognition task.

[1]  Vaibhava Goel,et al.  Advances in Very Deep Convolutional Neural Networks for LVCSR , 2016, INTERSPEECH.

[2]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[3]  Chellu Chandra Sekhar,et al.  GMM-Based Intermediate Matching Kernel for Classification of Varying Length Patterns of Long Duration Speech Using Support Vector Machines , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[4]  In-So Kweon,et al.  Fisher Kernel for Deep Neural Activations , 2014, ArXiv.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Shikha Gupta,et al.  Scene Image Classification Using Reduced Virtual Feature Representation in Sparse Framework , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[10]  Xiu-Shen Wei,et al.  Deep Spatial Pyramid: The Devil is Once Again in the Details , 2015, ArXiv.

[11]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Chellu Chandra Sekhar,et al.  Speaker recognition using pyramid match kernel based support vector machines , 2012, Int. J. Speech Technol..

[13]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[14]  Shikha Gupta,et al.  Deep Spatial Pyramid Match Kernel for Scene Classification , 2018, ICPRAM.

[15]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[17]  Haizhou Li,et al.  A GMM-based probabilistic sequence kernel for speaker verification , 2007, INTERSPEECH.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Shikha Gupta,et al.  Segment-level pyramid match kernels for the classification of varying length patterns of speech using SVMs , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[21]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[22]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[24]  N. Boujemaa,et al.  The intermediate matching kernel for image local features , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[25]  Shikha Gupta,et al.  Segment-Level Probabilistic Sequence Kernel Based Support Vector Machines for Classification of Varying Length Patterns of Speech , 2016, ICONIP.

[26]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..