Former-DFER: Dynamic Facial Expression Recognition Transformer

This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former). The CS-Former consists of five convolution blocks and N spatial encoders, which is designed to guide the network to learn occlusion and pose-robust facial features from the spatial perspective. And the temporal transformer consists of M temporal encoders, which is designed to allow the network to learn contextual facial features from the temporal perspective. The heatmaps of the leaned facial features demonstrate that the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal pose, and head motion. And the visualization of the feature distribution shows that the proposed method can learn more discriminative facial features. Moreover, our Former-DFER also achieves state-of-the-art results on the DFEW and AFEW benchmarks.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[3]  Victor O. K. Li,et al.  Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[4]  Pan Zhou,et al.  Video-based Facial Expression Recognition using Graph Convolutional Networks , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[5]  Abhinav Dhall,et al.  EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks , 2019, ICMI.

[6]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ding Liu,et al.  EnlightenGAN: Deep Light Enhancement Without Paired Supervision , 2019, IEEE Transactions on Image Processing.

[9]  Maja Pantic,et al.  Efficient N-Dimensional Convolutions via Higher-Order Factorization , 2019, ArXiv.

[10]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[11]  Cheng Lu,et al.  Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild , 2018, ICMI.

[12]  Qingshan Liu,et al.  Boosting encoded dynamic features for facial expression recognition , 2009, Pattern Recognit. Lett..

[13]  Zheru Chi,et al.  Facial Expression Recognition in Video with Multiple Feature Fusion , 2018, IEEE Transactions on Affective Computing.

[14]  Guoying Zhao,et al.  Aff-Wild: Valence and Arousal ‘In-the-Wild’ Challenge , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Byung Cheol Song,et al.  Visual Scene-aware Hybrid Neural Network Architecture for Video-based Facial Expression Recognition , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[17]  Shiguang Shan,et al.  Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism , 2019, IEEE Transactions on Image Processing.

[18]  Qingshan Liu,et al.  Spatio-temporal convolutional features with nested LSTM for facial expression recognition , 2018, Neurocomputing.

[19]  Zheru Chi,et al.  Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning , 2014, ICMI.

[20]  Shiguang Shan,et al.  Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild , 2014, ICMI.

[21]  Matti Pietikäinen,et al.  Facial expression recognition from near-infrared videos , 2011, Image Vis. Comput..

[22]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Feng Zhou,et al.  Robust Lightweight Facial Expression Recognition Network with Label Distribution Training , 2021, AAAI.

[25]  Matti Pietikäinen,et al.  Improved Spatiotemporal Local Monogenic Binary Pattern for Emotion Recognition in The Wild , 2014, ICMI.

[26]  Souvik Kundu,et al.  Facial Expression Recognition for Human Computer Interaction , 2020 .

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Wenming Zheng,et al.  DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild , 2020, ACM Multimedia.

[29]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Sergio Escalera,et al.  Survey on RGB, 3D, Thermal, and Multimodal Approaches for Facial Expression Recognition: History, Trends, and Affect-Related Applications , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[32]  C. Darwin The Expression of the Emotions in Man and Animals , .

[33]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[34]  Sergio Escalera,et al.  Spatiotemporal analysis of RGB-D-T facial images for multimodal pain level recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[36]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Sergio Escalera,et al.  Audio-Visual Emotion Recognition in Video Clips , 2019, IEEE Transactions on Affective Computing.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[41]  Dong-Yan Huang,et al.  Audio-visual emotion recognition using deep transfer learning and multiple temporal models , 2017, ICMI.

[42]  Yu Qiao,et al.  Frame Attention Networks for Facial Expression Recognition in Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[43]  Dinesh Manocha,et al.  M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues , 2020, AAAI.

[44]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[45]  Eric Granger,et al.  Emotion Recognition with Spatial Attention and Temporal Softmax Pooling , 2019, ICIAR.

[46]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[47]  Zhiyuan Li,et al.  Feature-Level and Model-Level Audiovisual Fusion for Emotion Recognition in the Wild , 2019, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[48]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[50]  Torsten Wilhelm,et al.  Towards Facial Expression Analysis in a Driver Assistance System , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[51]  Abhinav Dhall,et al.  Emotion recognition in the wild challenge 2013 , 2013, ICMI '13.

[52]  Li Yu,et al.  Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition , 2020, ECCV Workshops.

[53]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[54]  Xilin Chen,et al.  M3T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild , 2020, ArXiv.

[55]  Seungryong Kim,et al.  Context-Aware Emotion Recognition Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Yong Man Ro,et al.  Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition , 2018, AAAI.

[57]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[58]  Qingshan Liu,et al.  Facial expression recognition using encoded dynamic features , 2007, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Cheng Lu,et al.  Bi-modality Fusion for Emotion Recognition in the Wild , 2019, ICMI.

[60]  Christian D. Schunn,et al.  Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction , 2002, Proc. IEEE.

[61]  Jorn Ostermann,et al.  Two-Stream Aural-Visual Affect Analysis in the Wild , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[62]  Qingshan Liu,et al.  Phase Space Reconstruction Driven Spatio-Temporal Feature Learning for Dynamic Facial Expression Recognition , 2020, IEEE Transactions on Affective Computing.

[63]  In-So Kweon,et al.  BAM: Bottleneck Attention Module , 2018, BMVC.

[64]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[65]  Keiichiro Hoashi,et al.  Multi-Attention Fusion Network for Video-based Emotion Recognition , 2019, ICMI.

[66]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Fei Jiang,et al.  RFAU: A Database for Facial Action Unit Analysis in Real Classrooms , 2020 .

[68]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[69]  Qingshan Liu,et al.  Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild , 2021, IEEE Transactions on Image Processing.

[70]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[71]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[72]  Shiguang Shan,et al.  Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  LuceySimon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012 .

[74]  Maja Pantic,et al.  Web-based database for facial expression analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[75]  Ram Mohana Reddy Guddeti,et al.  Automatic detection of students’ affective states in classroom environment using hybrid convolutional neural networks , 2019, Education and Information Technologies.

[76]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[77]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[78]  Guoying Zhao,et al.  Graph-based Facial Affect Analysis: A Review of Methods, Applications and Challenges , 2021, ArXiv.

[79]  Pamela Ventola,et al.  A Facial Affect Analysis System for Autism Spectrum Disorder , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[80]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..