Self-Supervised Learning for Facial Action Unit Recognition through Temporal Consistency

Facial expressions have inherent temporal dependencies that can be leveraged in automatic facial expression analysis from videos. In this paper, we propose a self-supervised representation learning method for facial Action Unit (AU) recognition through learning temporal consistencies in videos. To this end, we use a triplet-based ranking approach that learns to rank the frames based on their temporal distance from an anchor frame. Instead of manually labeling informative triplets, we randomly select an anchor frame along with two additional frames with predefined distances from the anchor as positive and negative. To develop an effective metric learning approach, we introduce an aggregate ranking loss by taking the sum of multiple triplet losses to allow pairwise comparisons between adjacent frames. A Convolutional Neural Network (CNN) is used as encoder to learn representations by minimizing the objective loss. We demonstrate that our encoder learns meaningful representations for AU recognition with no labels. The encoder is evaluated for AU detection on various detasets including BP4D, EmotioNet and DISFA. Our results are comparable or superior to the state-of-the-art AU recognition through self-supervised learning.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Qiang Ji,et al.  Weakly-Supervised Deep Convolutional Neural Network Learning for Facial Action Unit Intensity Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Lijun Yin,et al.  Cross-domain AU Detection: Domains, Learning Approaches, and Measures , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[4]  Carlos D. Castillo,et al.  Facial Action Coding , 2009, Encyclopedia of Biometrics.

[5]  Philippe G. Schyns,et al.  Functional Smiles: Tools for Love, Sympathy, and War , 2017, Psychological science.

[6]  Shiguang Shan,et al.  Self-Supervised Representation Learning From Videos for Facial Action Unit Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Aleix M. Martínez,et al.  EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[11]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrew Zisserman,et al.  Self-supervised learning of a facial attribute embedding from video , 2018, BMVC.

[13]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Jianfei Cai,et al.  Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment , 2018, ECCV.

[16]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Tal Hassner,et al.  Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns , 2015, ICMI.

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Fernando De la Torre,et al.  Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[22]  Sergey Levine,et al.  Grasp2Vec: Learning Object Representations from Self-Supervised Grasping , 2018, CoRL.

[23]  Yang Hua,et al.  Ranked List Loss for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lijun Yin,et al.  EAC-Net: A Region-Based Deep Enhancing and Cropping Approach for Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[25]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[26]  Shiguang Shan,et al.  Local Relationship Learning With Person-Specific Shape Regularization for Facial Action Unit Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Aseem Agarwala,et al.  A Compact Embedding for Facial Expression Similarity , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Katherine B. Martin,et al.  Facial Action Coding System , 2015 .

[29]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[30]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[32]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[33]  Zhigang Zhu,et al.  Action Unit Detection with Region Adaptation, Multi-labeling Learning and Optimal Temporal Fusing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  LinLin Shen,et al.  Inferring Dynamic Representations of Facial Actions from a Still Image , 2019, ArXiv.

[35]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[36]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[37]  Maja Pantic,et al.  Transfer Learning for Action Unit Recognition , 2018, ArXiv.

[38]  Maja Pantic,et al.  Automatic Analysis of Facial Actions: A Survey , 2019, IEEE Transactions on Affective Computing.

[39]  Ram Nevatia,et al.  Deep, Landmark-Free FAME: Face Alignment, Modeling, and Expression Estimation , 2019, International Journal of Computer Vision.

[40]  Jeffrey F. Cohn,et al.  D-PAttNet: Dynamic Patch-Attentive Deep Network for Action Unit Detection , 2019, Front. Comput. Sci..