MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision due to variant backgrounds, low-quality facial images, and the subjectiveness of annotators. These uncertainties make it difficult for neural networks to learn robust features on limited-scale datasets. Moreover, the networks can be easily distributed by the above factors and perform incorrect decisions. Recently, vision transformer (ViT) and data-efficient image transformers (DeiT) present their significant performance in traditional classification tasks. The self-attention mechanism makes transformers obtain a global receptive field in the first layer which dramatically enhances the feature extraction capability. In this work, we first propose a novel pure transformer-based mask vision transformer (MVT) for FER in the wild, which consists of two modules: a transformerbased mask generation network (MGN) to generate a mask that can filter out complex backgrounds and occlusion of face images, and a dynamic relabeling module to rectify incorrect labels in FER datasets in the wild. Extensive experimental results demonstrate that our MVT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.

[1]  Xiaojun Qi,et al.  Discriminant Distribution-Agnostic Loss for Facial Expression Recognition in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[3]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Changsheng Xu,et al.  Joint Pose and Expression Modeling for Facial Expression Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Feng Zhou,et al.  Robust Lightweight Facial Expression Recognition Network with Label Distribution Training , 2021, AAAI.

[10]  Christina Huang,et al.  Combining convolutional neural networks for emotion recognition , 2017, 2017 IEEE MIT Undergraduate Research Technology Conference (URTC).

[11]  Aleix M. Martínez,et al.  EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[14]  Shiguang Shan,et al.  Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism , 2019, IEEE Transactions on Image Processing.

[15]  Thomas S. Huang,et al.  Emotion Recognition from Non-Frontal Facial Images , 2015 .

[16]  Victor O. K. Li,et al.  Facial Expression Recognition With Deeply-Supervised Attention Network , 2020, IEEE Transactions on Affective Computing.

[17]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[18]  Solomon Atnafu,et al.  Deepfake Video Detection Using Convolutional Vision Transformer , 2021, ArXiv.

[19]  Jianfei Yang,et al.  Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[20]  Bin Sun,et al.  Robust Facial Expression Recognition with Convolutional Visual Transformers , 2021, ArXiv.

[21]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[22]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[23]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[24]  Shaogang Gong,et al.  Robust facial expression recognition using local binary patterns , 2005, IEEE International Conference on Image Processing 2005.

[25]  Lianwen Jin,et al.  A New Facial Expression Recognition Method Based on Local Gabor Filter Bank and PCA plus LDA , 2006 .

[26]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[27]  Zheng Zhang,et al.  Facial Expression Recognition in the Wild Using Multi-Level Features and Attention Mechanisms , 2020, IEEE Transactions on Affective Computing.

[28]  Ioannis Pitas,et al.  Application of non-negative and local non negative matrix factorization to facial expression recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[29]  Yu Qiao,et al.  Learning Discriminative Representation For Facial Expression Recognition From Uncertainties , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[30]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[31]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[33]  Honghai Liu,et al.  Feature Selection Mechanism in CNNs for Facial Expression Recognition , 2018, BMVC.

[34]  Xinhe Xu,et al.  Facial expression recognition based on PCA and NMF , 2008, 2008 7th World Congress on Intelligent Control and Automation.

[35]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Shiguang Shan,et al.  Facial Expression Recognition with Inconsistently Annotated Datasets , 2018, ECCV.

[37]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Mohammad Reza Mohammadi,et al.  PCA-based dictionary building for accurate facial expression recognition via sparse representation , 2014, J. Vis. Commun. Image Represent..

[39]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Fahad Shahbaz Khan,et al.  Intriguing Properties of Vision Transformers , 2021, NeurIPS.

[41]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[42]  Yongxin Zhu,et al.  Recognizing Facial Expressions Using a Shallow Convolutional Neural Network , 2019, IEEE Access.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.