Audio-based emotion recognition using GMM supervector an SVM linear kernel

In this paper, we present an audio-based emotion recognition model by using OpenSmile, Gaussian mixture models (GMMs) Supervector and Support vector machines (SVM) with Linear kernel. Features are extracted from audio characteristics of emotional video through OpenSmile into Mel-frequency Cepstral Coefficient (MFCC) of 39 dimensions for each video. Furthermore, these features are normalized to the same size using GMM Supervector with 32 mixture components. Finally, data is classified using SVM with Linear Kernel. To evaluate the model, this paper using the AFEW2017 dataset and SAVEE dataset and show comparable the results on the state-of-the-art network. The experimental results perform with 37% on AFEW and 73.5% on SAVEE dataset. Our proposed achieves improved emotion recognition from audio as compared to several other models.

[1]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Shiguang Shan,et al.  Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild , 2014, ICMI.

[3]  Byung Cheol Song,et al.  Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild , 2017, ICMI.

[4]  K. V. Krishna Kishore,et al.  Emotion recognition in speech using MFCC and wavelet features , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[5]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[6]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[7]  Jesse Hoey,et al.  From individual to group-level emotion recognition: EmotiW 5.0 , 2017, ICMI.

[8]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[9]  P. Jackson,et al.  Multimodal Emotion Recognition , 2010 .

[10]  Ping Hu,et al.  Learning supervised scoring ensemble for emotion recognition in the wild , 2017, ICMI.

[11]  Stefan Wermter,et al.  Learning auditory neural representations for emotion recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[12]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[13]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[14]  Divakar Yadav,et al.  UNDERSTANDING AND ESTIMATION OF EMOTIONAL EXPRESSION USING ACOUSTIC ANALYSIS OF NATURAL SPEECH , 2013 .

[15]  Tamás D. Gedeon,et al.  Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol , 2014, ICMI.

[16]  Dong-Yan Huang,et al.  Audio-visual emotion recognition using deep transfer learning and multiple temporal models , 2017, ICMI.

[17]  Evaggelos Spyrou,et al.  Recognizing Emotional States Using Speech Information. , 2017, Advances in experimental medicine and biology.

[18]  Rita Cucchiara,et al.  Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild , 2017, ICMI.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.