论文信息 - POSTER V2: A simpler and stronger facial expression recognition network

POSTER V2: A simpler and stronger facial expression recognition network

Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction. POSTER achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features through two-stream pyramid cross-fusion design. However, the architecture of POSTER is undoubtedly complex. It causes expensive computational costs. In order to relieve the computational pressure of POSTER, in this paper, we propose POSTER++. It improves POSTER in three directions: cross-fusion, two-stream, and multi-scale feature extraction. In cross-fusion, we use window-based cross-attention mechanism replacing vanilla cross-attention mechanism. We remove the image-to-landmark branch in the two-stream design. For multi-scale feature extraction, POSTER++ combines images with landmark's multi-scale features to replace POSTER's pyramid design. Extensive experiments on several standard datasets show that our POSTER++ achieves the SOTA FER performance with the minimum computational cost. For example, POSTER++ reached 92.21% on RAF-DB, 67.49% on AffectNet (7 cls) and 63.77% on AffectNet (8 cls), respectively, using only 8.4G floating point operations (FLOPs) and 43.7M parameters (Param). This demonstrates the effectiveness of our improvements.

[1] Tao Wang,et al. Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition , 2021, Biomimetics.

[2] Weihong Deng,et al. Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition , 2022, ECCV.

[3] Fei Wang,et al. Face2Exp: Combating Data Biases for Facial Expression Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] ByoungChul Ko,et al. Facial Expression Recognition Based on Squeeze Vision Transformer , 2022, Sensors.

[5] Mat'ias Mendieta,et al. POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition , 2022, ArXiv.

[6] Quanfu Fan,et al. RegionViT: Regional-to-Local Attention for Vision Transformers , 2021, ICLR.

[7] Khanh Nguyen,et al. Global-local attention for emotion recognition , 2021, Neural Computing and Applications.

[8] Guodong Guo,et al. TransFER: Learning Relation-aware Facial Expression Representations with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Qingshan Liu,et al. Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild , 2021, IEEE Transactions on Image Processing.

[10] Feng Zhou,et al. Robust Lightweight Facial Expression Recognition Network with Label Distribution Training , 2021, AAAI.

[11] Lihi Zelnik-Manor,et al. ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[12] Humphrey Shi,et al. Escaping the Big Data Paradigm with Compact Transformers , 2021, ArXiv.

[13] Z. Chai,et al. Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Matthijs Douze,et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Tao Mei,et al. Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andrey V. Savchenko,et al. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks , 2021, 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY).

[17] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Shutao Li,et al. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion , 2021, IEEE Transactions on Affective Computing.

[19] Seong Joon Oh,et al. Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] N. Codella,et al. CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Quanfu Fan,et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[23] Song-Chun Zhu,et al. Learning to Amend Facial Expression Representation via De-albino and Affinity , 2021, ArXiv.

[24] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Francis E. H. Tay,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Xinbo Gao,et al. Adaptively Learning Facial Expression Representation via C-F Labels and Distillation , 2021, IEEE Transactions on Image Processing.

[27] Xiaojun Qi,et al. Facial Expression Recognition in the Wild via Deep Attentive Center Loss , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[29] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[30] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Qinquan Gao,et al. Graph Reasoning-Based Emotion Recognition Network , 2021, IEEE Access.

[32] Dongyoon Han,et al. ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network , 2020, ArXiv.

[33] Zhongchao Shi,et al. Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jianfei Yang,et al. Suppressing Uncertainties for Large-Scale Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Jianfei Yang,et al. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[36] Yan Li,et al. MA-Net: A Multi-Scale Attention Network for Liver and Tumor Segmentation , 2020, IEEE Access.

[37] Hyung-Jeong Yang,et al. Pyramid With Super Resolution for In-the-Wild Facial Expression Recognition , 2020, IEEE Access.

[38] Seungryong Kim,et al. Context-Aware Emotion Recognition Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[40] Stefanos Zafeiriou,et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Mohammad H. Mahoor,et al. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[42] Victor O. K. Li,et al. Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[43] Dinh Viet Sang,et al. Discriminative Deep Feature Learning for Facial Emotion Recognition , 2018, 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR).

[44] Junping Du,et al. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[46] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Yuxiao Hu,et al. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[48] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50] Qingshan Liu,et al. Learning active facial patches for expression analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51] Richard Bowden,et al. Local binary patterns for multi-view facial expression recognition , 2011 .

[52] Oksam Chae,et al. Local Directional Pattern (LDP) for face recognition , 2010, 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE).

[53] Lijun Yin,et al. Multi-view facial expression recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[54] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[55] Matti Pietikäinen,et al. Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[57] Matti Pietikäinen,et al. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[58] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.