Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features

Abstract An ensemble visual-audio emotion recognition framework is proposed based on multi-task and blending learning with multiple features in this paper. To solve the problem that existing features can not accurately identify different emotions, we extract two kinds features, i. e., Interspeech 2010 and deep features for audio data, LBP and deep features for visual data, with the intent to accurately identify different emotions by using different features. Owing to the diversity of features, SVM classifiers and CNN are designed for manual features, i.e., Interspeech 2010 features and local LBP features, and deep features, through which four sub-models are obtained. Finally, the blending ensemble algorithm is used to fuse sub-models to improve the recognition performance of visual-audio emotion recognition. In addition, multi-task learning is applied in the CNN model for deep features, which can predict multiple tasks at the same time with fewer parameters and improve the sensitivity of the single recognition model to user’s emotion by sharing information between different tasks. Experiments are performed using eNTERFACCE database, from which the results indicate that the recognition of multi-task CNN increased by 3% and 2% on average over CNN model in speaker-independent and speaker-dependent experiments, respectively. And emotion recognition accuracy of visual-audio by our method reaches 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively, which maintain higher performance than some state-of-the-art works.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Shiqing Zhang,et al.  Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech , 2012, MMSP 2012.

[3]  Sergio Escalera,et al.  Audio-Visual Emotion Recognition in Video Clips , 2019, IEEE Transactions on Affective Computing.

[4]  Shashidhar G. Koolagudi,et al.  SVM Scheme for Speech Emotion Recognition using MFCC Feature , 2013 .

[5]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[6]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[7]  Meng Wang,et al.  Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Guoyin Wang,et al.  A Novel Emotion Recognition Method Based on Ensemble Learning and Rough Set Theory , 2011, Int. J. Cogn. Informatics Nat. Intell..

[9]  Subhasmita Sahoo,et al.  Emotion recognition from audio-visual data using rule based decision level fusion , 2016, 2016 IEEE Students’ Technology Symposium (TechSym).

[10]  Haibo Li,et al.  Sparse Kernel Reduced-Rank Regression for Bimodal Emotion Recognition From Facial Expression and Speech , 2016, IEEE Transactions on Multimedia.

[11]  Dinggang Shen,et al.  Subspace Regularized Sparse Multitask Learning for Multiclass Neurodegenerative Disease Identification , 2016, IEEE Transactions on Biomedical Engineering.

[12]  Min Wu,et al.  A facial expression emotion recognition based human-robot interaction system , 2017, IEEE/CAA Journal of Automatica Sinica.

[13]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[14]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[15]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[17]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jinhui Tang,et al.  Unsupervised Feature Selection via Nonnegative Spectral Analysis and Redundancy Control , 2015, IEEE Transactions on Image Processing.

[19]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[20]  Mansour Sheikhan,et al.  Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks , 2015, Multimedia Tools and Applications.

[21]  Baoqing Li,et al.  Facial Expression Recognition From Image Sequence Based on LBP and Taylor Expansion , 2017, IEEE Access.

[22]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[24]  Xi Bin Jia,et al.  Facial Expression Recognition Based on Gabor Features and Fuzzy Classifier , 2012 .

[25]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[27]  Min Wu,et al.  Proposal of initiative service model for service robot , 2017, CAAI Trans. Intell. Technol..

[28]  Mansour Sheikhan,et al.  Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method , 2013, Neural Computing and Applications.

[29]  Qingshan Liu,et al.  Learning active facial patches for expression analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Cigdem Eroglu Erdem,et al.  Multimodal emotion recognition with automatic peak frame selection , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[31]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Subramanian Ramanathan,et al.  A Multi-Task Learning Framework for Head Pose Estimation under Target Motion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Gholamreza Anbarjafari,et al.  Vocal-based emotion recognition using random forests and decision tree , 2017, International Journal of Speech Technology.

[34]  Min Wu,et al.  A multimodal emotional communication based humans-robots interaction system , 2016, 2016 35th Chinese Control Conference (CCC).

[35]  Zheru Chi,et al.  Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning , 2014, ICMI.

[36]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Guihua Wen,et al.  Ensemble softmax regression model for speech emotion recognition , 2017, Multimedia Tools and Applications.

[38]  Graham W. Taylor,et al.  Multi-task Learning of Facial Landmarks and Expression , 2014, 2014 Canadian Conference on Computer and Robot Vision.

[39]  Nasrollah Moghaddam Charkari,et al.  Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks , 2014, Neural Computing and Applications.

[40]  Yibin Li,et al.  Facial expression recognition with PCA and LBP features extracting from active facial patches , 2016, 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR).

[41]  Min Wu,et al.  Speech emotion recognition based on an improved brain emotion learning model , 2018, Neurocomputing.

[42]  Qingxuan Jia,et al.  Weighted Feature Gaussian Kernel SVM for Emotion Recognition , 2016, Comput. Intell. Neurosci..

[43]  Zhong Yin,et al.  Recognition of emotions using multimodal physiological signals and an ensemble deep learning model , 2017, Comput. Methods Programs Biomed..

[44]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[45]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[46]  J. Kilner,et al.  Facial Emotion Recognition and Expression in Parkinson’s Disease: An Emotional Mirror Mechanism? , 2017, PloS one.