Meta-transfer learning for emotion recognition

Deep learning has been widely adopted in automatic emotion recognition and has lead to significant progress in the field. However, due to insufficient annotated emotion datasets, pre-trained models are limited in their generalization capability and thus lead to poor performance on novel test sets. To mitigate this challenge, transfer learning performing fine-tuning on pre-trained models has been applied. However, the fine-tuned knowledge may overwrite and/or discard important knowledge learned from pre-trained models. In this paper, we address this issue by proposing a PathNet-based transfer learning method that is able to transfer emotional knowledge learned from one visual/audio emotion domain to another visual/audio emotion domain, and transfer the emotional knowledge learned from multiple audio emotion domains into one another to improve overall emotion recognition accuracy. To show the robustness of our proposed system, various sets of experiments for facial expression recognition and speech emotion recognition task on three emotion datasets: SAVEE, EMODB, and eNTERFACE have been carried out. The experimental results indicate that our proposed system is capable of improving the performance of emotion recognition, making its performance substantially superior to the recent proposed fine-tuning/pre-trained models based transfer learning methods.

[1]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[2]  Zhongzhe Xiao,et al.  Features extraction and selection for emotional speech classification , 2005, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005..

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[5]  Gwenn Englebienne,et al.  Learning spectro-temporal features with 3D CNNs for speech emotion recognition , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[6]  Gwenn Englebienne,et al.  Towards Speech Emotion Recognition "in the Wild" Using Aggregated Corpora and Deep Multi-Task Learning , 2017, INTERSPEECH.

[7]  Sridha Sridharan,et al.  Meta Transfer Learning for Facial Emotion Recognition , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[8]  Gwenn Englebienne,et al.  Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition , 2017, ACM Multimedia.

[9]  Rajib Rana,et al.  Cross Corpus Speech Emotion Classification- An Effective Transfer Learning Technique , 2018, ArXiv.

[10]  Saurabh Sahu,et al.  Adversarial Auto-Encoders for Speech Based Emotion Recognition , 2017, INTERSPEECH.

[11]  Sridha Sridharan,et al.  Using Synthetic Data to Improve Facial Expression Analysis with 3D Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[13]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[14]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Sridha Sridharan,et al.  Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition , 2018, Comput. Vis. Image Underst..

[16]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[17]  Wenwu Wang,et al.  Machine Audition: Principles, Algorithms and Systems , 2010 .

[18]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[19]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[20]  Björn Schuller,et al.  Recognizing Emotions From Whispered Speech Based on Acoustic Feature Transfer Learning , 2017, IEEE Access.

[21]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[23]  Florian Eyben,et al.  Real-time Speech and Music Classification by Large Audio Feature Space Extraction , 2015 .

[24]  Inman Harvey,et al.  The Microbial Genetic Algorithm , 2009, ECAL.

[25]  Wen Gao,et al.  Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Stefan Winkler,et al.  Deep Learning for Emotion Recognition on Small Datasets using Transfer Learning , 2015, ICMI.

[27]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[28]  Xiong Chen,et al.  Automatic Speech Emotion Recognition using Support Vector Machine , 2011, Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.

[29]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[30]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[31]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[32]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[33]  Sridha Sridharan,et al.  Deep Spatio-Temporal Features for Multimodal Emotion Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Sridha Sridharan,et al.  Joint Deep Cross-Domain Transfer Learning for Emotion Recognition , 2020, ArXiv.

[35]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[36]  Tsang-Long Pao,et al.  Mandarin Emotional Speech Recognition Based on SVM and NN , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[37]  Stefan Scherer,et al.  Learning representations of emotional speech with deep convolutional generative adversarial networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Mohammad H. Mahoor,et al.  Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[40]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.