Multi-Modal Recurrent Attention Networks for Facial Expression Recognition

Recent deep neural networks based methods have achieved state-of-the-art performance on various facial expression recognition tasks. Despite such progress, previous researches for facial expression recognition have mainly focused on analyzing color recording videos only. However, the complex emotions expressed by people with different skin colors under different lighting conditions through dynamic facial expressions can be fully understandable by integrating information from multi-modal videos. We present a novel method to estimate dimensional emotion states, where color, depth, and thermal recording videos are used as a multi-modal input. Our networks, called multi-modal recurrent attention networks (MRAN), learn spatiotemporal attention volumes to robustly recognize the facial expression based on attention-boosted feature volumes. We leverage the depth and thermal sequences as guidance priors for color sequence to selectively focus on emotional discriminative regions. We also introduce a novel benchmark for multi-modal facial expression recognition, termed as multi-modal arousal-valence facial expression recognition (MAVFER), which consists of color, depth, and thermal recording videos with corresponding continuous arousal-valence scores. The experimental results show that our method can achieve the state-of-the-art results in dimensional facial expression recognition on color recording datasets including RECOLA, SEWA and AFEW, and a multi-modal recording dataset including MAVFER.

[1]  Jun Wang,et al.  A 3D facial expression database for facial behavior research , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[2]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[3]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[4]  Arman Savran,et al.  Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering , 2012, ICMI '12.

[5]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[6]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[7]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[10]  Peng Liu,et al.  Spontaneous facial expression analysis based on temperature changes and head motions , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[11]  Shan Li,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[12]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[13]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[14]  Kurt Keutzer,et al.  PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression , 2019, ACM Multimedia.

[15]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[16]  Cynthia LeRouge,et al.  Developing multimodal intelligent affective interfaces for tele-home health care , 2003, Int. J. Hum. Comput. Stud..

[17]  Kan Chen,et al.  AMC: Attention Guided Multi-modal Correlation Learning for Image Search , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Arthur C. Graesser,et al.  Toward an Affect-Sensitive AutoTutor , 2007, IEEE Intelligent Systems.

[19]  Victor O. K. Li,et al.  Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[20]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[21]  Shaun J. Canavan,et al.  Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tamás D. Gedeon,et al.  Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol , 2014, ICMI.

[25]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[26]  Kwanghoon Sohn,et al.  Automatic 2D-to-3D conversion using multi-scale deep neural network , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[27]  Yong Du,et al.  Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks , 2017, IEEE Transactions on Image Processing.

[28]  Àgata Lapedriza,et al.  Emotion Recognition in Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[30]  Kwanghoon Sohn,et al.  Deeply Aggregated Alternating Minimization for Image Restoration , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Raymond W. M. Ng,et al.  Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition , 2018, CoNLL.

[32]  M. Bradley,et al.  Measuring emotion: the Self-Assessment Manikin and the Semantic Differential. , 1994, Journal of behavior therapy and experimental psychiatry.

[33]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[35]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[36]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[37]  Michael J. Lyons,et al.  Coding facial expressions with Gabor wavelets , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[38]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Thomas S. Huang,et al.  Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[40]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[41]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[42]  Kihong Park,et al.  Unified multi-spectral pedestrian detection based on probabilistic fusion networks , 2018, Pattern Recognit..

[43]  Qin Jin,et al.  Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[44]  P. Ekman,et al.  DIFFERENCES Universals and Cultural Differences in the Judgments of Facial Expressions of Emotion , 2004 .

[45]  Gang Wang,et al.  Multimodal Recurrent Neural Networks With Information Transfer Layers for Indoor Scene Labeling , 2018, IEEE Transactions on Multimedia.

[46]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[47]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[48]  Fei Chen,et al.  A Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference , 2010, IEEE Transactions on Multimedia.

[49]  Mohammad Soleymani,et al.  Analysis of EEG Signals and Facial Expressions for Continuous Emotion Detection , 2016, IEEE Transactions on Affective Computing.

[50]  Maja Pantic,et al.  AFEW-VA database for valence and arousal estimation in-the-wild , 2017, Image Vis. Comput..

[51]  Ming-Hsuan Yang,et al.  Weakly Supervised Coupled Networks for Visual Sentiment Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Chen Feng,et al.  Near-infrared guided color image dehazing , 2013, 2013 IEEE International Conference on Image Processing.

[55]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[56]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[57]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[58]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Luc Van Gool,et al.  A 3-D Audio-Visual Corpus of Affective Communication , 2010, IEEE Transactions on Multimedia.

[61]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[62]  Daniel McDuff,et al.  Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected "In-the-Wild" , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[63]  Wen-Jing Yan,et al.  How Fast are the Leaked Facial Expressions: The Duration of Micro-Expressions , 2013 .

[64]  Gwen Littlewort,et al.  Fully Automatic Facial Action Recognition in Spontaneous Behavior , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[65]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[66]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Katherine B. Martin,et al.  Facial Action Coding System , 2015 .

[68]  Seungryong Kim,et al.  Spatiotemporal Attention Based Deep Neural Networks for Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Julian Togelius,et al.  Experience-Driven Procedural Content Generation , 2011, IEEE Transactions on Affective Computing.

[70]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[71]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[72]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[73]  Margaret McRorie,et al.  The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[74]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[75]  Sergio Escalera,et al.  Multi-modal RGB–Depth–Thermal Human Body Segmentation , 2016, International Journal of Computer Vision.

[76]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[77]  Fabien Ringeval,et al.  SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Guoying Zhao,et al.  Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond , 2018, International Journal of Computer Vision.

[79]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Gwen Littlewort,et al.  Recognizing facial expression: machine learning and application to spontaneous behavior , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).