Improving the Accuracy of Automatic Facial Expression Recognition in Speaking Subjects with Deep Learning

When automatic facial expression recognition is applied to video sequences of speaking subjects, the recognition accuracy has been noted to be lower than with video sequences of still subjects. This effect known as the speaking effect arises during spontaneous conversations, and along with the affective expressions the speech articulation process influences facial configurations. In this work we question whether, aside from facial features, other cues relating to the articulation process would increase emotion recognition accuracy when added in input to a deep neural network model. We develop two neural networks that classify facial expressions in speaking subjects from the RAVDESS dataset, a spatio-temporal CNN and a GRU cell RNN. They are first trained on facial features only, and afterwards both on facial features and articulation related cues extracted from a model trained for lip reading, while varying the number of consecutive frames provided in input as well. We show that using DNNs the addition of features related to articulation increases classification accuracy up to 12%, the increase being greater with more consecutive frames provided in input to the model.

[1]  Zankhana H. Shah,et al.  Facial Expression Recognition: A Survey , 2014 .

[2]  Giuliano Grossi,et al.  Deep Construction of an Affective Latent Space via Multimodal Enactment , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[3]  Prateek Verma,et al.  Audio-linguistic Embeddings for Spoken Sentences , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Raffaella Lanzarotti,et al.  AMHUSE: a multimodal dataset for HUmour SEnsing , 2017, ICMI.

[5]  Nicoletta Noceti,et al.  Positive technology for elderly well-being: A review , 2020, Pattern Recognit. Lett..

[6]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[7]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[8]  M. Kreutzer,et al.  DARWIN and FACIAL EXPRESSION A CENTURY OF RESEARCH IN REVIEW , 2014 .

[9]  Shaun J. Canavan,et al.  Ubiquitous Emotion Recognition Using Audio and Video Data , 2018, UbiComp/ISWC Adjunct.

[10]  M. Pantic,et al.  Induced Disgust , Happiness and Surprise : an Addition to the MMI Facial Expression Database , 2010 .

[11]  Chung-Hsien Wu,et al.  Speaking Effect Removal on Emotion Recognition From Facial Expressions Based on Eigenface Conversion , 2013, IEEE Transactions on Multimedia.

[12]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[13]  Rajiv Ratn Shah,et al.  Bagged support vector machines for emotion recognition from speech , 2019, Knowl. Based Syst..

[14]  T. Dalgleish Basic Emotions , 2004 .

[15]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[16]  Emily Mower Provost,et al.  Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition , 2014, ACM Multimedia.

[17]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[18]  Carlos Busso,et al.  Facial Expression Recognition in the Presence of Speech Using Blind Lexical Compensation , 2016, IEEE Transactions on Affective Computing.

[19]  Andrea Cavallaro,et al.  Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[21]  Nicoletta Noceti,et al.  Stairway to Elders: Bridging Space, Time and Emotions in Their Social Environment for Wellbeing , 2020, ICPRAM.

[22]  Sergio Escalera,et al.  Review on Emotion Recognition Databases , 2017, Human-Robot Interaction - Theory and Application.

[23]  Alex Pentland,et al.  Human-Centred Intelligent Human-Computer Interaction (HCI2): how far are we from attaining it? , 2008, Int. J. Auton. Adapt. Commun. Syst..

[24]  Zhihong Zeng,et al.  Bimodal HCI-related affect recognition , 2004, ICMI '04.

[25]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[26]  Carlos Busso,et al.  Compensating for speaker or lexical variabilities in speech for emotion recognition , 2014, Speech Commun..

[27]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[28]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[29]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[30]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[31]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[32]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[33]  Aleix M. Martinez,et al.  Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements , 2019, Psychological science in the public interest : a journal of the American Psychological Society.

[34]  Roger K. Moore,et al.  Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition , 2019, INTERSPEECH.

[35]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[36]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[37]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[38]  Jianfei Cai,et al.  Facial Motion Prior Networks for Facial Expression Recognition , 2019, 2019 IEEE Visual Communications and Image Processing (VCIP).

[39]  Sajib Hasan,et al.  Emotion Detection from Speech Signals using Voting Mechanism on Classified Frames , 2019, 2019 International Conference on Robotics,Electrical and Signal Processing Techniques (ICREST).

[40]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[41]  Maja Pantic,et al.  Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[42]  Emad Barsoum,et al.  Emotion recognition in the wild from videos using images , 2016, ICMI.

[43]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[44]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[45]  Lijun Yin,et al.  Static and dynamic 3D facial expression recognition: A comprehensive survey , 2012, Image Vis. Comput..

[46]  Alexander Gelbukh,et al.  DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , 2019, EMNLP.