Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network

Automatic speech emotion recognition is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. We propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups which may contain redundant and unrelated information leading to low emotion recognition performance in this work. After obtaining the informative features, a fusion network is trained to jointly learn the discriminative acoustic feature representation and a Support Vector Machine (SVM) is used as the final classifier for recognition task. Experimental results on the IEMOCAP dataset demonstrate that the proposed architecture improved the recognition performance, achieving accuracy of 64% compared to existing state-of-the-art approaches.

[1]  Erik Cambria,et al.  Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[2]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[3]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[4]  Björn W. Schuller,et al.  Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[5]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Stefan Wermter,et al.  Reusing Neural Speech Representations for Auditory Emotion Recognition , 2017, IJCNLP.

[7]  Zhong-Qiu Wang,et al.  Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nilanjan Ray,et al.  Object Detection With DoG Scale-Space: A Multiple Kernel Learning Approach , 2012, IEEE Transactions on Image Processing.

[9]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[10]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[11]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[12]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[13]  Chengxin Li,et al.  Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Alessandro Moschitti,et al.  Twitter Sentiment Analysis with Deep Convolutional Neural Networks , 2015, SIGIR.

[15]  Marie Tahon,et al.  Towards a Small Set of Robust Acoustic Features for Emotion Recognition: Challenges , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[17]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[19]  Pavel Matejka,et al.  Multimodal Emotion Recognition for AVEC 2016 Challenge , 2016, AVEC@ACM Multimedia.

[20]  Yongming Huang,et al.  Adaptive Wavelet Packet Filter-Bank Based Acoustic Feature for Speech Emotion Recognition , 2013 .

[21]  Ivan Marsic,et al.  Deep Mul Timodal Learning for Emotion Recognition in Spoken Language , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Richard J Davidson,et al.  Test-retest reliability of voluntary emotion regulation. , 2009, Psychophysiology.

[23]  Paul D. Gader,et al.  Model level fusion of edge histogram descriptors and gabor wavelets for landmine detection with ground penetrating radar , 2010, 2010 IEEE International Geoscience and Remote Sensing Symposium.

[24]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Verónica Pérez-Rosas,et al.  Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[26]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[27]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[28]  Elisabeth André,et al.  Emotion-specific dichotomous classification and feature-level fusion of multichannel biosignals for automatic emotion recognition , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[29]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[30]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[31]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[32]  Tanaya Guha,et al.  Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions , 2014, AVEC '14.

[33]  Qinghua Hu,et al.  SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection , 2018, IEEE Transactions on Cybernetics.

[34]  Wolfgang Menzel,et al.  An architecture for incremental information fusion of cross-modal representations , 2012, 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).

[35]  Guoyong Cai,et al.  Convolutional Neural Networks for Multimedia Sentiment Analysis , 2015, NLPCC.

[36]  Yi-Ping Phoebe Chen,et al.  Acoustic Features Extraction for Emotion Recognition , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[37]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[38]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Huang Jian,et al.  Multimodal Emotion Recognition with Transfer Learning of Deep Neural Network , 2020 .

[40]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[41]  Han Wen Review on Speech Emotion Recognition , 2014 .

[42]  Qi Tian,et al.  HMM-Based Audio Keyword Generation , 2004, PCM.

[43]  M. Shamim Hossain,et al.  Audio–Visual Emotion-Aware Cloud Gaming Framework , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[45]  Qinghua Hu,et al.  Heterogeneous Feature Selection With Multi-Modal Deep Neural Networks and Sparse Group LASSO , 2015, IEEE Transactions on Multimedia.

[46]  I. Christie,et al.  Autonomic specificity of discrete emotion and dimensions of affective space: a multivariate approach. , 2004, International journal of psychophysiology : official journal of the International Organization of Psychophysiology.

[47]  Björn Schuller,et al.  Recognizing Emotions From Whispered Speech Based on Acoustic Feature Transfer Learning , 2017, IEEE Access.

[48]  Björn W. Schuller,et al.  Feature selection in multimodal continuous emotion prediction , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[49]  Byung Cheol Song,et al.  Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild , 2017, ICMI.

[50]  Ivan Marsic,et al.  Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.