Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Mustaqeem,et al.  Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features , 2020, Sensors.

[3]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sylvain Meignier,et al.  An Open-Source Speaker Gender Detection Framework for Monitoring Gender Equality , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Akshay Deepak,et al.  A survey of speech emotion recognition in natural environment , 2021, Digit. Signal Process..

[6]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[7]  VanderborghtBram,et al.  An Autonomous Cognitive Empathy Model Responsive to Users’ Facial Emotion Expressions , 2020 .

[8]  A. Luebbe,et al.  An Emotion Recognition–Awareness Vulnerability Hypothesis for Depression in Adolescence: A Systematic Review , 2019, Clinical Child and Family Psychology Review.

[9]  David Griol,et al.  The Conversational Interface: Talking to Smart Devices , 2016 .

[10]  Vineet Kumar,et al.  A multimodal hierarchical approach to speech emotion recognition from audio and text , 2021, Knowl. Based Syst..

[11]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[13]  Wei Peng,et al.  Systematic Review: Trust-Building Factors and Implications for Conversational Agent Design , 2021, Int. J. Hum. Comput. Interact..

[14]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[15]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[16]  Zheng Lian,et al.  Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model , 2021, MuSe @ ACM Multimedia.

[17]  David Griol,et al.  The Conversational Interface , 2016 .

[18]  Kai Yu,et al.  Towards Duration Robust Weakly Supervised Sound Event Detection , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Valentina Franzoni,et al.  Enhancing Mouth-Based Emotion Recognition Using Transfer Learning , 2020, Sensors.

[20]  Zheng Lian,et al.  Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism , 2020, MuSe @ ACM Multimedia.

[21]  Rosalind W. Picard,et al.  Driver Emotion Recognition for Intelligent Vehicles , 2020, ACM Comput. Surv..

[22]  T. Dalgleish Basic Emotions , 2004 .

[23]  Mustaqeem,et al.  Att-Net: Enhanced emotion recognition system using lightweight self-attention module , 2021, Appl. Soft Comput..

[24]  Abdulkadir Şengür,et al.  Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition , 2021 .

[25]  Shervin Minaee,et al.  Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network , 2019, Sensors.

[26]  Yagya Raj Pandeya,et al.  Deep learning-based late fusion of multimodal information for emotion classification of music video , 2020, Multimedia Tools and Applications.

[27]  R. Plutchik The Nature of Emotions , 2001 .

[28]  Haimo Zhang,et al.  Speech Emotion Recognition 'in the Wild' Using an Autoencoder , 2020, INTERSPEECH.

[29]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[30]  Adnan Yazici,et al.  Speech emotion recognition with deep convolutional neural networks , 2020, Biomed. Signal Process. Control..

[31]  Tyler H. Shaw,et al.  From ‘automation’ to ‘autonomy’: the importance of trust repair in human–machine interaction , 2018, Ergonomics.

[32]  S. Prasanth,et al.  Speech emotion recognition based on machine learning tactics and algorithms , 2021 .

[33]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[34]  Héctor Alaiz-Moretón,et al.  Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network , 2021, Biomed. Signal Process. Control..

[35]  A. Milton,et al.  Improved speech emotion recognition with Mel frequency magnitude coefficient , 2021 .

[36]  John A. Stankovic,et al.  Robustness to noise for speech emotion classification using CNNs and attention mechanisms , 2020, Smart Health.

[37]  Nazmul Siddique,et al.  Facial Emotion Recognition Using Transfer Learning in the Deep CNN , 2021, Electronics.

[38]  Dirk Lefeber,et al.  An Autonomous Cognitive Empathy Model Responsive to Users’ Facial Emotion Expressions , 2020, ACM Trans. Interact. Intell. Syst..

[39]  Catherine Newmark Charles Darwin: The Expression of the Emotions in Man and Animals , 2013 .

[40]  Björn W. Schuller,et al.  Sentiment analysis using image-based deep spectrum features , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[41]  Didier Grandjean,et al.  Facial emotion recognition in Parkinson's disease: A review and new hypotheses , 2018, Movement disorders : official journal of the Movement Disorder Society.

[42]  Masato Akagi,et al.  Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model , 2019, 2019 IEEE International Conference on Signals and Systems (ICSigSys).

[43]  Hung-Hsuan Huang,et al.  Embodied Conversational Agents , 2009 .

[44]  Suramya Tomar,et al.  Converting video formats with FFmpeg , 2006 .

[45]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[46]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[47]  Fu Lee Wang,et al.  Speech emotion recognition based on DNN-decision tree SVM model , 2019, Speech Commun..

[48]  Zoraida Callejas Carrión,et al.  Sentiment Analysis: From Opinion Mining to Human-Agent Interaction , 2016, IEEE Transactions on Affective Computing.

[49]  Luciana Ferrer,et al.  Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings , 2021, Interspeech.

[50]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[51]  Binh T. Nguyen,et al.  An efficient real-time emotion detection using camera and facial landmarks , 2017, 2017 Seventh International Conference on Information Science and Technology (ICIST).

[52]  Anna Esposito,et al.  Discriminative Power of EEG-Based Biomarkers in Major Depressive Disorder: A Systematic Review , 2021, IEEE Access.

[53]  Alan F. Smeaton,et al.  The Influence of Audio on Video Memorability with an Audio Gestalt Regulated Video Memorability System , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[54]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[55]  John Pavlopoulos,et al.  Deep Learning for User Comment Moderation , 2017, ALW@ACL.

[56]  Fangfang Fan,et al.  Semi-supervised classification-aware cross-modal deep adversarial data augmentation , 2021, Future Gener. Comput. Syst..

[57]  Asif Ekbal,et al.  Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding , 2020, Expert Syst. Appl..

[58]  Clement H. C. Leung,et al.  Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition , 2021, BI.

[59]  Björn Schuller,et al.  Group-level Speech Emotion Recognition Utilising Deep Spectrum Features , 2020, ICMI.

[60]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[61]  Husam Ali Abdulmohsin,et al.  A new proposed statistical feature extraction method in speech emotion recognition , 2021, Comput. Electr. Eng..

[62]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  J. Russell,et al.  The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology , 2005, Development and Psychopathology.

[64]  Alfredo Milani,et al.  Detecting Hate Speech for Italian Language in Social Media , 2018, EVALITA@CLiC-it.

[65]  Yoshua Bengio,et al.  Challenges in Representation Learning: A Report on Three Machine Learning Contests , 2013, ICONIP.

[66]  C. Darwin The Expression of the Emotions in Man and Animals , .

[67]  Rainer Goebel,et al.  Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[68]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[69]  Valentina Franzoni,et al.  SEMO: a semantic model for emotion recognition in web objects , 2017, WI.

[70]  Jordi Vallverdú,et al.  Emotional machines: The next revolution , 2019, Web Intell..

[71]  W PicardRosalind,et al.  Driver Emotion Recognition for Intelligent Vehicles , 2020 .

[72]  Prameela Naga,et al.  Facial emotion recognition methods, datasets and technologies: A literature survey , 2021, Materials Today: Proceedings.

[73]  Tomasz Trzcinski,et al.  Classifying and Visualizing Emotions with Emotional DAN , 2018, Fundam. Informaticae.

[74]  Fernando Fernández Martínez,et al.  GTH-UPM at DETOXIS-IberLEF 2021: Automatic Detection of Toxic Comments in Social Networks , 2021, IberLEF@SEPLN.

[75]  Rajiv Ratn Shah,et al.  Bagged support vector machines for emotion recognition from speech , 2019, Knowl. Based Syst..

[76]  Fernando Fernández-Martínez,et al.  Guided Spatial Transformers for Facial Expression Recognition , 2021, Applied Sciences.

[77]  Zoraida Callejas,et al.  The Role of Trust in Proactive Conversational Assistants , 2021, IEEE Access.

[78]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[79]  Jordi Vallverdú,et al.  Emotional affordances in human-machine interactive planning and negotiation , 2017, WI.

[80]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[81]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[82]  Shrikanth Narayanan,et al.  NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning , 2018, *SEMEVAL.

[83]  Kaya Oguz,et al.  Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers , 2020, Speech Commun..

[84]  Eoghan Furey,et al.  The Emotographic Iceberg: Modelling Deep Emotional Affects Utilizing Intelligent Assistants and the IoT , 2019, 2019 19th International Conference on Computational Science and Its Applications (ICCSA).