Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

Abstract The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross-modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image-based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network(DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set.

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  Edilson de Aguiar,et al.  Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order , 2017, Pattern Recognit..

[3]  Ping Liu,et al.  Facial Expression Recognition via a Boosted Deep Belief Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  J Healey,et al.  Quantifying driver stress: developing a system for collecting and processing bio-metric signals in natural situations. , 1999, Biomedical sciences instrumentation.

[5]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[6]  Sarah N Garfinkel,et al.  Interoception, emotion and brain: new insights link internal physiology to social behaviour. Commentary on:: "Anterior insular cortex mediates bodily sensibility and social anxiety" by Terasawa et al. (2012). , 2013, Social cognitive and affective neuroscience.

[7]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Oliver Chiu-sing Choy,et al.  An efficient MFCC extraction method in speech recognition , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  Takeo Kanade,et al.  Facial Expression Recognition , 2011, Handbook of Face Recognition.

[12]  Prashant Kumar Manglik,et al.  Facial expression recognition , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[13]  P. Niedenthal,et al.  Fashioning the Face: Sensorimotor Simulation Contributes to Facial Expression Recognition , 2016, Trends in Cognitive Sciences.

[14]  Beat Fasel,et al.  Automati Fa ial Expression Analysis: A Survey , 1999 .

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[17]  Rohit Sinha,et al.  Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers , 2013, 2013 National Conference on Communications (NCC).

[18]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[19]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[20]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[21]  Gwen Littlewort,et al.  Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[22]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[23]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[24]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Chung-Hsien Wu,et al.  Emotion Perception and Recognition from Speech , 2009, Affective Information Processing.

[27]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[28]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[29]  Shrikanth S. Narayanan,et al.  A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice , 2013, INTERSPEECH.

[30]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[32]  Clifford So,et al.  Visualization of time-varying joint development of pitch and dynamics for speech emotion recognition , 2014 .

[33]  Björn W. Schuller,et al.  Intelligent Audio Analysis , 2013, Signals and communication technology.

[34]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[35]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[36]  G. Sivaranjani,et al.  EMOTION RECOGNITION FROM SPEECH WITH GAUSSIAN MIXTURE MODELS AND VIA BOOSTED GMM , 2018 .

[37]  Wei Wu,et al.  GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38]  Ling Shao,et al.  Transfer Learning for Visual Categorization: A Survey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Ning An,et al.  Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[40]  Dieter Vaitl,et al.  Interoception , 1996, Biological Psychology.

[41]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[43]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..