论文信息 - Spatiotemporal Contrastive Learning of Facial Expressions in Videos

Spatiotemporal Contrastive Learning of Facial Expressions in Videos

We propose a self-supervised contrastive learning approach for facial expression recognition (FER) in videos. We propose a novel temporal sampling-based augmentation scheme to be utilized in addition to standard spatial augmentations used for contrastive learning. Our proposed temporal augmentation scheme randomly picks from one of three temporal sampling techniques: (1) pure random sampling, (2) uniform sampling, and (3) sequential sampling. This is followed by a combination of up to three standard spatial augmentations. We then use a deep R(2+1)D network for FER, which we train in a self-supervised fashion based on the augmentations and subsequently fine-tune. Experiments are performed on the Oulu-CASIA dataset and the performance is compared to other works in FER. The results indicate that our method achieves an accuracy of 89.4%, setting a new state-of-the-art by outperforming other works. Additional experiments and analysis confirm the considerable contribution of the proposed temporal augmentation versus the existing spatial ones.

Ali Etemad | Shuvendu Roy | A. Etemad | Shuvendu Roy

[1] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Shiguang Shan,et al. Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis , 2014, ACCV.

[3] Shuicheng Yan,et al. Peak-Piloted Deep Network for Facial Expression Recognition , 2016, ECCV.

[4] Shang-Hong Lai,et al. A Compact Deep Learning Model for Robust Facial Expression Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[6] Bo Sun,et al. LSTM for dynamic emotion and group emotion recognition in the wild , 2016, ICMI.

[7] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Ali Etemad,et al. Self-supervised Wearable-based Activity Recognition by Learning to Forecast Motion , 2020, ArXiv.

[9] Fernando Pereira,et al. CapsField: Light Field-Based Face and Expression Recognition in the Wild Using Capsule Routing , 2021, IEEE Transactions on Image Processing.

[10] Nadia Bianchi-Berthouze,et al. Instant Stress: Detection of Perceived Mental Stress Through Smartphone Photoplethysmography and Thermal Imaging , 2018, bioRxiv.

[11] Ali Etemad,et al. Self-Supervised Learning for ECG-Based Emotion Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[13] A. Mueller,et al. Detection of Maternal and Fetal Stress from ECG with Self-supervised Representation Learning , 2020 .

[14] Junmo Kim,et al. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15] Paulo Lobato Correia,et al. Facial Emotion Recognition Using Light Field Images with Deep Attention-Based Bidirectional LSTM , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Matti Pietikäinen,et al. Dynamic Facial Expression Recognition Using Longitudinal Facial Expression Atlases , 2012, ECCV.

[17] Shiguang Shan,et al. Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Rama Chellappa,et al. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[19] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[20] Rita Cucchiara,et al. Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild , 2017, ICMI.

[21] Yong Du,et al. Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks , 2017, IEEE Transactions on Image Processing.

[22] Liang Chen,et al. Self-supervised learning for medical image analysis using image context restoration , 2019, Medical Image Anal..

[23] Bolei Zhou,et al. Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Ali Etemad,et al. Classification of Hand Movements From EEG Using a Deep Attention-Based LSTM Network , 2019, IEEE Sensors Journal.

[25] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Paolo Favaro,et al. Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[28] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[29] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30] Matti Pietikäinen,et al. Facial expression recognition from near-infrared videos , 2011, Image Vis. Comput..

[31] Junji Yamato,et al. Inferring mood in ubiquitous conversational video , 2013, MUM.

[32] Matti Pietikäinen,et al. Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[34] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[35] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[36] Joyce H. D. M. Westerink,et al. Mood Recognition Based on Upper Body Posture and Movement Features , 2011, ACII.

[37] Cheng Lu,et al. Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild , 2018, ICMI.

[38] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .

[39] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] A. Etemad,et al. Self-Supervised ECG Representation Learning for Emotion Recognition , 2020, IEEE Transactions on Affective Computing.

[41] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Jinfu Liu,et al. Patch Attention Layer of Embedding Handcrafted Features in CNN for Facial Expression Recognition , 2021, Sensors.

[45] Guangcan Liu,et al. Deeper cascaded peak-piloted network for weak expression recognition , 2018, The Visual Computer.

[46] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[47] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[50] Johan Lukkien,et al. Multi-task Self-Supervised Learning for Human Activity Detection , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[51] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).