Ensemble System for Multimodal Emotion Recognition Challenge (MEC 2017)

Speech emotion recognition (SER) is a challenging task with many problems unsolved, such as extracting representative features and dealing with imbalanced training data. For the past few decades, much research has been done in this area, but the performance is still far from satisfying. In this paper, we propose an Ensemble System which fuses four different subsystems. TDNN (Time Delay Neural Network) System uses a neural network with p-norm and time delay as the classifier. i-vector/SVM (Support Vector Machine) System learns the acoustic feature from i-vector space. Simple Late Fusion System fuses different features on the decision level while Balanced Late Fusion System introduces a data rebalance module to rebalance the class distribution of the training samples. The overall Ensemble System takes the advantages of each subsystem on the decision level. Experiments are conducted on the CHEAVD 2.0 database which is provided in the Multimodal Emotion Recognition Challenge. The results of the Simple Late Fusion System on the test set outperforms the baseline system by 3.9% and 6.9% on Accuracy (ACC) and Macro Average Precision (MAP), respectively. Our results indicate that the Simple Late Fusion System is more effective on ACC and MAP while the Balanced Late Fusion System outperforms other systems on Macro Average Recall and Macro Average F1.

[1]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[2]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[3]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[4]  Wei Wu,et al.  GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[6]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[9]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[10]  S. Ramakrishnan,et al.  Speech Enhancement, Modeling And Recognition: Algorithms And Applications , 2014 .

[11]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[12]  Thomas Fang Zheng,et al.  Relative entropy normalized Gaussian supervector for speech emotion recognition using kernel extreme learning machine , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[13]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[14]  Malay Kishore Dutta,et al.  Speech emotion recognition with deep learning , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[15]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[16]  Shiguang Shan,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[17]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[18]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[21]  Khyrina Airin Fariza Abu Samah,et al.  Comparing statistical classifiers for emotion classification , 2017, 2017 7th IEEE International Conference on System Engineering and Technology (ICSET).

[22]  S. Ramakrishnan Recognition of Emotion from Speech: A Review , 2012 .

[23]  Chung-Hsien Wu,et al.  Speech emotion recognition with ensemble learning methods , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ya Li,et al.  Long short term memory recurrent neural network based encoding method for emotion recognition in video , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Pascale Fung,et al.  A first look into a Convolutional Neural Network for speech emotion detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ya Li,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Weishan Zhang,et al.  Emotion Recognition in Speech Using Multi-classification SVM , 2015, 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom).

[29]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.