Facial expression recognition with grid-wise attention and visual transformer

Abstract F acial E xpression R ecognition (FER) has achieved remarkable progress as a result of using C onvolutional N eural N etworks (CNN). Relying on the spatial locality, convolutional filters in CNN, however, fail to learn long-range inductive biases between different facial regions in most neural layers. As such, the performance of a CNN-based model for FER is still limited. To address this problem, this paper introduces a novel FER framework with two attention mechanisms for CNN-based models, and these two attention mechanisms are used for the low-level feature learning the high-level semantic representation, respectively. In particular, in the low-level feature learning, a grid-wise attention mechanism is proposed to capture the dependencies of different regions from a facial expression image such that the parameter update of convolutional filters in low-level feature learning is regularized. In the high-level semantic representation, a visual transformer attention mechanism uses a sequence of visual semantic tokens (generated from pyramid features of high convolutional layer blocks) to learn the global representation. Extensive experiments have been conducted on three public facial expression datasets, CK+, FER+, and RAF-DB. The results show that our FER-VT has achieved state-of-the-art performance on these datasets, especially with a 100% accuracy on CK + datasets without any extra training data.

[1]  M. V. Lamar,et al.  Recognizing facial actions using Gabor wavelets with neutral face average difference , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[2]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[3]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[4]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Shuicheng Yan,et al.  Peak-Piloted Deep Network for Facial Expression Recognition , 2016, ECCV.

[6]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Luming Zhang,et al.  A ROI-guided deep architecture for robust facial expressions recognition , 2020, Inf. Sci..

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shervin Minaee,et al.  Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network , 2019, Sensors.

[11]  Xiaojun Qi,et al.  Facial Expression Recognition in the Wild via Deep Attentive Center Loss , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Maja Pantic,et al.  Web-based database for facial expression analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[14]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[15]  Namita Mittal,et al.  Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy , 2019, The Visual Computer.

[16]  A. Sufian,et al.  Evolution of Image Segmentation using Deep Convolutional Neural Network: A Survey , 2020, Knowl. Based Syst..

[17]  Jingying Chen,et al.  Deep peak-neutral difference feature for facial expression recognition , 2018, Multimedia Tools and Applications.

[18]  Wai Keung Wong,et al.  Discriminative deep multi-task learning for facial expression recognition , 2020, Inf. Sci..

[19]  Emad Barsoum,et al.  Emotion recognition in the wild from videos using images , 2016, ICMI.

[20]  Qiang Ji,et al.  Facial Action Unit Recognition by Exploiting Their Dynamic and Semantic Relationships , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[22]  He Jun,et al.  Facial Expression Recognition Based on VGGNet Convolutional Neural Network , 2018, 2018 Chinese Automation Congress (CAC).

[23]  Jianfei Cai,et al.  Facial Motion Prior Networks for Facial Expression Recognition , 2019, 2019 IEEE Visual Communications and Image Processing (VCIP).

[24]  Qingshan Liu,et al.  Learning active facial patches for expression analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Ondrej Krejcar,et al.  Facial Expression Recognition Using Local Gravitational Force Descriptor-Based Deep Convolution Neural Networks , 2021, IEEE Transactions on Instrumentation and Measurement.

[26]  Yu Qiao,et al.  Frame Attention Networks for Facial Expression Recognition in Videos , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Haifeng Hu,et al.  Modified classification and regression tree for facial expression recognition with using difference expression images , 2017 .

[29]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[30]  M. Shamim Hossain,et al.  Emotion recognition using secure edge and cloud computing , 2019, Inf. Sci..

[31]  Guang Liu,et al.  Learning performance prediction via convolutional GRU and explainable neural networks in e-learning environments , 2019, Computing.

[32]  Matti Pietikäinen,et al.  Facial expression recognition from near-infrared videos , 2011, Image Vis. Comput..

[33]  Jie Shao,et al.  Three convolutional neural network models for facial expression recognition in the wild , 2019, Neurocomputing.

[34]  Victor O. K. Li,et al.  Video-based Emotion Recognition Using Deeply-Supervised Neural Networks , 2018, ICMI.

[35]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[36]  Yong Du,et al.  Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks , 2017, IEEE Transactions on Image Processing.

[37]  Shan Li,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Unconstrained Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[38]  Jianfei Yang,et al.  Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[39]  Ante Odic,et al.  The impact of weak ground truth and facial expressiveness on affect detection accuracy from time-continuous videos of facial expressions , 2013, Inf. Sci..

[40]  Yoshua Bengio,et al.  Challenges in representation learning: A report on three machine learning contests , 2013, Neural Networks.

[41]  Debi Prosad Dogra,et al.  Independent Bayesian classifier combination based sign language recognition using facial expression , 2018, Inf. Sci..