Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Facial Expression Recognition (FER) in the wild is extremely challenging due to occlusions, variant head poses, face deformation and motion blur under unconstrained conditions. Although substantial progresses have been made in automatic FER in the past few decades, previous studies were mainly designed for lab-controlled FER. Real-world occlusions, variant head poses and other issues definitely increase the difficulty of FER on account of these information-deficient regions and complex backgrounds. Different from previous pure CNNs based methods, we argue that it is feasible and practical to translate facial images into sequences of visual words and perform expression recognition from a global perspective. Therefore, we propose the Visual Transformers with Feature Fusion (VTFF) to tackle FER in the wild by two main steps. First, we propose the attentional selective fusion (ASF) for leveraging two kinds of feature maps generated by two-branch CNNs. The ASF captures discriminative information by fusing multiple features with the global-local attention. The fused feature maps are then flattened and projected into sequences of visual words. Second, inspired by the success of Transformers in natural language processing, we propose to model relationships between these visual words with the global self-attention. The proposed method is evaluated on three public in-the-wild facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same settings, extensive experiments demonstrate that our method shows superior performance over other methods, setting new state of the art on RAF-DB with 88.14%, FERPlus with 88.81% and AffectNet with 61.85%. The cross-dataset evaluation on CK+ shows the promising generalization capability of the proposed method.

[1]  Ning Sun,et al.  Unsupervised Cross-View Facial Expression Image Generation and Recognition , 2023, IEEE Transactions on Affective Computing.

[2]  Pooran Singh Negi,et al.  Deep Siamese Neural Networks for Facial Expression Recognition in the Wild , 2021, IEEE Transactions on Affective Computing.

[3]  Zheng Zhang,et al.  Facial Expression Recognition in the Wild Using Multi-Level Features and Attention Mechanisms , 2020, IEEE Transactions on Affective Computing.

[4]  Fei-Yue Wang,et al.  Relation-Aware Facial Expression Recognition , 2022, IEEE Transactions on Cognitive and Developmental Systems.

[5]  Jinxing Li,et al.  Learning Informative and Discriminative Features for Facial Expression Recognition in the Wild , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Linlin Shen,et al.  Triplet Loss With Multistage Outlier Suppression and Class-Pair Margins for Facial Expression Recognition , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Changsheng Xu,et al.  Weakly-Supervised Facial Expression Recognition in the Wild With Noisy Data , 2021, IEEE Transactions on Multimedia.

[8]  Victor O. K. Li,et al.  Facial Expression Recognition With Deeply-Supervised Attention Network , 2020, IEEE Transactions on Affective Computing.

[9]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[10]  Longhui Wei,et al.  Visformer: The Vision-friendly Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[16]  François Brémond,et al.  Expression recognition with deep features extracted from holistic and part-based models , 2020, Image Vis. Comput..

[17]  Eric Tzeng,et al.  Toward Transformer-Based Object Detection , 2020, ArXiv.

[18]  Jean Oh,et al.  Trajformer: Trajectory Prediction with Local Self-Attentive Contexts for Autonomous Driving , 2020, ArXiv.

[19]  Luxi Yang,et al.  ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis , 2020, ArXiv.

[20]  Yu Qiao,et al.  Learning Discriminative Representation For Facial Expression Recognition From Uncertainties , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[21]  Sinan Kalkan,et al.  Investigating Bias and Fairness in Facial Expression Recognition , 2020, ECCV Workshops.

[22]  Jianfei Yang,et al.  Suppressing Uncertainties for Large-Scale Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jianfei Yang,et al.  Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[24]  Wankou Yang,et al.  TransPose: Towards Explainable Human Pose Estimation by Transformer , 2020, ArXiv.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Jie Shao,et al.  Three convolutional neural network models for facial expression recognition in the wild , 2019, Neurocomputing.

[27]  Yongxin Zhu,et al.  Recognizing Facial Expressions Using a Shallow Convolutional Neural Network , 2019, IEEE Access.

[28]  Shiguang Shan,et al.  Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism , 2019, IEEE Transactions on Image Processing.

[29]  Shan Li,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[30]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[31]  Shiguang Shan,et al.  Facial Expression Recognition with Inconsistently Annotated Datasets , 2018, ECCV.

[32]  Shiguang Shan,et al.  Patch-Gated CNN for Occlusion-aware Facial Expression Recognition , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[33]  Changsheng Xu,et al.  Joint Pose and Expression Modeling for Facial Expression Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Shang-Hong Lai,et al.  Emotion-Preserving Representation Learning via Generative Adversarial Network for Multi-View Facial Expression Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[35]  Shiguang Shan,et al.  Multi-Channel Pose-Aware Convolution Neural Networks for Multi-View Facial Expression Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[36]  Honghai Liu,et al.  Feature Selection Mechanism in CNNs for Facial Expression Recognition , 2018, BMVC.

[37]  Zheru Chi,et al.  Facial Expression Recognition in Video with Multiple Feature Fusion , 2018, IEEE Transactions on Affective Computing.

[38]  Christina Huang,et al.  Combining convolutional neural networks for emotion recognition , 2017, 2017 IEEE MIT Undergraduate Research Technology Conference (URTC).

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[44]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Mahadev Satyanarayanan,et al.  OpenFace: A general-purpose face recognition library with mobile applications , 2016 .

[47]  Aurobinda Routray,et al.  Automatic facial expression recognition using features of salient facial patches , 2015, IEEE Transactions on Affective Computing.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Yong Man Ro,et al.  Intra-Class Variation Reduction Using Training Expression Images for Sparse Representation Based Facial Expression Recognition , 2014, IEEE Transactions on Affective Computing.

[50]  Wenming Zheng,et al.  Multi-View Facial Expression Recognition Based on Group Sparse Reduced-Rank Regression , 2014, IEEE Transactions on Affective Computing.

[51]  Kin-Man Lam,et al.  Multi-resolution feature fusion for face recognition , 2014, Pattern Recognit..

[52]  Yichuan Tang,et al.  Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[53]  Matti Pietikäinen,et al.  Facial expression recognition from near-infrared videos , 2011, Image Vis. Comput..

[54]  Richard Bowden,et al.  Local binary patterns for multi-view facial expression recognition , 2011 .

[55]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[56]  M. Pantic,et al.  Induced Disgust , Happiness and Surprise : an Addition to the MMI Facial Expression Database , 2010 .

[57]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Shaogang Gong,et al.  Robust facial expression recognition using local binary patterns , 2005, IEEE International Conference on Image Processing 2005.

[59]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[60]  Ioannis Pitas,et al.  Application of non-negative and local non negative matrix factorization to facial expression recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[61]  M. Pietikäinen,et al.  FACIAL EXPRESSION RECOGNITION WITH LOCAL BINARY PATTERNS AND LINEAR PROGRAMMING , 2004 .

[62]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Claude C. Chibelushi,et al.  Recognition of Facial Expressions in the Presence of Occlusion , 2001, BMVC.

[64]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[65]  C. Darwin,et al.  The Expression of the Emotions in Man and Animals , 1872 .