POSTER V2: A simpler and stronger facial expression recognition network

Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction. POSTER achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features through two-stream pyramid cross-fusion design. However, the architecture of POSTER is undoubtedly complex. It causes expensive computational costs. In order to relieve the computational pressure of POSTER, in this paper, we propose POSTER++. It improves POSTER in three directions: cross-fusion, two-stream, and multi-scale feature extraction. In cross-fusion, we use window-based cross-attention mechanism replacing vanilla cross-attention mechanism. We remove the image-to-landmark branch in the two-stream design. For multi-scale feature extraction, POSTER++ combines images with landmark's multi-scale features to replace POSTER's pyramid design. Extensive experiments on several standard datasets show that our POSTER++ achieves the SOTA FER performance with the minimum computational cost. For example, POSTER++ reached 92.21% on RAF-DB, 67.49% on AffectNet (7 cls) and 63.77% on AffectNet (8 cls), respectively, using only 8.4G floating point operations (FLOPs) and 43.7M parameters (Param). This demonstrates the effectiveness of our improvements.

[1]  Tao Wang,et al.  Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition , 2021, Biomimetics.

[2]  Weihong Deng,et al.  Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition , 2022, ECCV.

[3]  Fei Wang,et al.  Face2Exp: Combating Data Biases for Facial Expression Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  ByoungChul Ko,et al.  Facial Expression Recognition Based on Squeeze Vision Transformer , 2022, Sensors.

[5]  Mat'ias Mendieta,et al.  POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition , 2022, ArXiv.

[6]  Quanfu Fan,et al.  RegionViT: Regional-to-Local Attention for Vision Transformers , 2021, ICLR.

[7]  Khanh Nguyen,et al.  Global-local attention for emotion recognition , 2021, Neural Computing and Applications.

[8]  Guodong Guo,et al.  TransFER: Learning Relation-aware Facial Expression Representations with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Qingshan Liu,et al.  Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild , 2021, IEEE Transactions on Image Processing.

[10]  Feng Zhou,et al.  Robust Lightweight Facial Expression Recognition Network with Label Distribution Training , 2021, AAAI.

[11]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[12]  Humphrey Shi,et al.  Escaping the Big Data Paradigm with Compact Transformers , 2021, ArXiv.

[13]  Z. Chai,et al.  Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Tao Mei,et al.  Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrey V. Savchenko,et al.  Facial expression and attributes recognition based on multi-task learning of lightweight neural networks , 2021, 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY).

[17]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Shutao Li,et al.  Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion , 2021, IEEE Transactions on Affective Computing.

[19]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[23]  Song-Chun Zhu,et al.  Learning to Amend Facial Expression Representation via De-albino and Affinity , 2021, ArXiv.

[24]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Xinbo Gao,et al.  Adaptively Learning Facial Expression Representation via C-F Labels and Distillation , 2021, IEEE Transactions on Image Processing.

[27]  Xiaojun Qi,et al.  Facial Expression Recognition in the Wild via Deep Attentive Center Loss , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[29]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[30]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Qinquan Gao,et al.  Graph Reasoning-Based Emotion Recognition Network , 2021, IEEE Access.

[32]  Dongyoon Han,et al.  ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network , 2020, ArXiv.

[33]  Zhongchao Shi,et al.  Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jianfei Yang,et al.  Suppressing Uncertainties for Large-Scale Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jianfei Yang,et al.  Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[36]  Yan Li,et al.  MA-Net: A Multi-Scale Attention Network for Liver and Tumor Segmentation , 2020, IEEE Access.

[37]  Hyung-Jeong Yang,et al.  Pyramid With Super Resolution for In-the-Wild Facial Expression Recognition , 2020, IEEE Access.

[38]  Seungryong Kim,et al.  Context-Aware Emotion Recognition Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[40]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[42]  Victor O. K. Li,et al.  Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[43]  Dinh Viet Sang,et al.  Discriminative Deep Feature Learning for Facial Emotion Recognition , 2018, 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR).

[44]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[46]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Qingshan Liu,et al.  Learning active facial patches for expression analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Richard Bowden,et al.  Local binary patterns for multi-view facial expression recognition , 2011 .

[52]  Oksam Chae,et al.  Local Directional Pattern (LDP) for face recognition , 2010, 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE).

[53]  Lijun Yin,et al.  Multi-view facial expression recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[54]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[55]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[57]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.