Weakly-Supervised Attention and Relation Learning for Facial Action Unit Detection

Attention mechanism has recently attracted increasing attentions in the field of facial action unit (AU) detection. By finding the region of interest (ROI) of each AU with the attention mechanism, AU related local features can be captured. Most existing AU detection works design fixed attentions based on the prior knowledge, without considering the nonrigidity of AUs and relations among AUs. In this paper, we propose a novel end-to-end weakly-supervised attention and relation learning framework for AU detection, which has not been explored before. In particular, to select and extract AU related features, both channel-wise attentions and spatial attentions are learned with AU labels only. Moreover, pixel-level relations for AUs are learned to refine spatial attentions and extract more accurate local features. A multi-scale local region learning method is further proposed to adapt multi-scale AUs in different locations, which can facilitate the weakly-supervised attention and relation learning. Extensive experiments on BP4D and DISFA benchmarks demonstrate that our framework (i) outperforms the state-of-the-art methods for AU detection, and (ii) also achieves superior performance of AU intensity estimation with a simple extension.

[1]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[6]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Stefanos Zafeiriou,et al.  Markov Random Field Structures for Facial Action Unit Intensity Estimation , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[9]  Shiguang Shan,et al.  AU-aware Deep Networks for facial expression recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[10]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[11]  Ivor W. Tsang,et al.  Feature Disentangling Machine - A Novel Approach of Feature Selection and Disentangling in Facial Expression Analysis , 2014, ECCV.

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[14]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[16]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Maja Pantic,et al.  Multi-conditional Latent Variable Model for Joint Facial Action Unit Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Maja Pantic,et al.  Latent trees for estimating intensity of Facial Action Units , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Qingshan Liu,et al.  Learning Multiscale Active Facial Patches for Expression Analysis , 2015, IEEE Transactions on Cybernetics.

[23]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Gang Hua,et al.  Ordinal Regression with Multiple Output CNN for Age Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mohammad H. Mahoor,et al.  Task-dependent multi-task multiple kernel learning for facial action unit detection , 2016, Pattern Recognit..

[27]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[28]  Gang Wang,et al.  Recurrent Attentional Networks for Saliency Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Honggang Zhang,et al.  Joint Patch and Multi-label Learning for Facial Action Unit and Holistic Expression Recognition , 2016, IEEE Transactions on Image Processing.

[30]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Qiang Ji,et al.  Constrained Joint Cascade Regression Framework for Simultaneous Facial Action Unit Recognition and Facial Landmark Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Vladimir Pavlovic,et al.  Deep Structured Learning for Facial Action Unit Intensity Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Fernando De la Torre,et al.  Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[34]  Zhigang Zhu,et al.  Action Unit Detection with Region Adaptation, Multi-labeling Learning and Optimal Temporal Fusing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Sergio Escalera,et al.  Deep Structure Inference Network for Facial Action Unit Recognition , 2018, ECCV.

[36]  Quanshi Zhang,et al.  Interpretable Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Michel F. Valstar,et al.  Joint Action Unit localisation and intensity estimation through heatmap regression , 2018, BMVC.

[38]  Lijun Yin,et al.  EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Maja Pantic,et al.  Automatic Analysis of Facial Actions: A Survey , 2019, IEEE Transactions on Affective Computing.

[41]  Vladimir Pavlovic,et al.  Copula Ordinal Regression Framework for Joint Estimation of Facial Action Unit Intensity , 2019, IEEE Transactions on Affective Computing.