MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism that help obtain the global receptive field since the first layer. It even achieves surprising performance exceeding CNN in some vision tasks. However, there exists an issue when leveraging vision transformer into 2D+3D facial expression recognition (FER), i.e., ViT training needs mass data. Nonetheless, the number of samples in public 2D+3D FER datasets is far from sufficient for evaluation. How to utilize the ViT pre-trained on RGB images to handle 2D+3D data becomes a challenge. To solve this problem, we propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT. For narrowing the gap between RGB and multimodal data, we devise an alternative fusion strategy, which replaces each of the three channels of an RGB image with the depth-map channel and fuses them before feeding them into the transformer encoder. Moreover, the designed sample filtering module adds several subclasses for each expression and move the noisy samples to their corresponding subclasses, thus eliminating their disturbance on the network during the training stage. Extensive experiments demonstrate that our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus. Specifically, the proposed MFEViT is a lightweight model, requiring much fewer parameters than multi-branch CNNs. To the best of our knowledge, this is the first work to introduce vision transformer into multimodal 2D+3D FER. The source code of our MFEViT will be publicly available online.

[1]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[4]  Shan Li,et al.  Deep Facial Expression Recognition: A Survey , 2018, IEEE Transactions on Affective Computing.

[5]  Liming Chen,et al.  Unsupervised Domain Adaptation with Regularized Optimal Transport for Multimodal 2D+3D Facial Expression Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[6]  Zheng Lian,et al.  Expression Analysis Based on Face Regions in Real-world Conditions , 2019, Int. J. Autom. Comput..

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[9]  Jun Wang,et al.  A 3D facial expression database for facial behavior research , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[10]  Guangming Shi,et al.  Facial Attention based Convolutional Neural Network for 2D+3D Facial Expression Recognition , 2019, 2019 IEEE Visual Communications and Image Processing (VCIP).

[11]  Liming Chen,et al.  Author manuscript, published in "Workshop 3D Face Biometrics, IEEE Automatic Facial and Gesture Recognition, Shanghai: China (2013)" Fully Automatic 3D Facial Expression Recognition using Differential Mean Curvature Maps and Histograms of Oriented Gradien , 2013 .

[12]  Di Huang,et al.  Discriminative Attention-based Convolutional Neural Network for 3D Facial Expression Recognition , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Guoying Zhao,et al.  3D Facial Expression Recognition Based on Multi-View and Prior Knowledge Fusion , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[15]  Qian Yin,et al.  3D Facial Expression Recognition Using Deep Feature Fusion CNN , 2019, 2019 30th Irish Signals and Systems Conference (ISSC).

[16]  Ashutosh Saxena,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Alberto Del Bimbo,et al.  A Set of Selected SIFT Features for 3D Facial Expression Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[18]  Zhengjun Zha,et al.  MViT: Mask Vision Transformer for Facial Expression Recognition in the wild , 2021, ArXiv.

[19]  Arman Savran,et al.  Bosphorus Database for 3D Face Analysis , 2008, BIOID.

[20]  Qiuqi Ruan,et al.  FERLrTc: 2D+3D facial expression recognition via low-rank tensor completion , 2019, Signal Process..

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  Liming Chen,et al.  Accurate Facial Parts Localization and Deep Learning for 3D Facial Expression Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[23]  Syed Zulqarnain Gilani,et al.  Learning from Millions of 3D Scans for Large-Scale 3D Face Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Hui Yu,et al.  Real-Time Facial Affective Computing on Mobile Devices , 2020, Sensors.

[25]  Jianfei Yang,et al.  Suppressing Uncertainties for Large-Scale Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27]  Liming Chen,et al.  Intensity Enhancement Via Gan for Multimodal Facial Expression Recognition , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[28]  Xi Zhao,et al.  An efficient multimodal 2D + 3D feature-based approach to automatic facial expression recognition , 2015, Comput. Vis. Image Underst..

[29]  Liming Chen,et al.  3D facial expression recognition via multiple kernel learning of Multi-Scale Local Normal Patterns , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[30]  Trac D. Tran,et al.  2D+3D Facial Expression Recognition via Discriminative Dynamic Range Enhancement and Multi-Scale Learning , 2020, ArXiv.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Yuan Wang,et al.  TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network , 2021, ACM Multimedia.

[33]  Yicong Zhou,et al.  Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition , 2020, IEEE Transactions on Multimedia.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Liming Chen,et al.  Fast and Light Manifold CNN based 3D Facial Expression Recognition across Pose Variations , 2018, ACM Multimedia.

[36]  Feng Wu,et al.  FFNet-M: Feature Fusion Network with Masks for Multimodal Facial Expression Recognition , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[37]  Jian Sun,et al.  Multimodal 2D+3D Facial Expression Recognition With Deep Fusion Convolutional Neural Network , 2017, IEEE Transactions on Multimedia.