EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection

In this paper, we propose a deep learning based approach for facial action unit (AU) detection by enhancing and cropping regions of interest of face images. The approach is implemented by adding two novel nets (a.k.a. layers): the enhancing layers and the cropping layers, to a pretrained convolutional neural network (CNN) model. For the enhancing layers (noted as E-Net), we have designed an attention map based on facial landmark features and apply it to a pretrained neural network to conduct enhanced learning. For the cropping layers (noted as C-Net ), we crop facial regions around the detected landmarks and design individual convolutional layers to learn deeper features for each facial region. We then combine the E-Net and the C-Net to construct a so-called Enhancing and Cropping Net (EAC-Net), which can learn both features enhancing and region cropping functions effectively. The EAC-Net integrates three important elements, i.e., learning transfer, attention coding, and regions of interest processing, making our AU detection approach more efficient and more robust to facial position and orientation changes. Our approach shows a significant performance improvement over the state-of-the-art methods when tested on the BP4D and DISFA AU datasets. The EAC-Net with a slight modification also shows its potentials in estimating accurate AU intensities. We have also studied the performance of the proposed EAC-Net under two very challenging conditions: (1) faces with partial occlusion and (2) faces with large head pose variations. Experimental results show that (1) the EAC-Net learns facial AUs correlation effectively and predicts AUs reliably even with only half of a face being visible, especially for the lower half; (2) Our EAC-Net model also works well under very large head poses, which outperforms significantly a compared baseline approach. It further shows that the EAC-Net works much better without a face frontalization than with face frontalization through image warping as pre-processing, in terms of computational efficiency and AU detection accuracy.

[1]  Simon Lucey,et al.  How much training data for facial action unit detection? , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Maja Pantic,et al.  Fully Automatic Facial Action Unit Detection and Temporal Analysis , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[5]  Qiang Ji,et al.  Constrained Joint Cascade Regression Framework for Simultaneous Facial Action Unit Recognition and Facial Landmark Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Qiang Ji,et al.  Capturing Global Semantic Relationships for Facial Action Unit Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Peter Robinson,et al.  Cross-dataset learning and person-specific normalisation for automatic Action Unit detection , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[8]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Aleix M. Martínez,et al.  EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kristy Elizabeth Boyer,et al.  Automatically Recognizing Facial Expression: Predicting Engagement and Frustration , 2013, EDM.

[11]  H. Emrah Tasli,et al.  Deep learning based FACS Action Unit occurrence and intensity estimation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[13]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Zhang Xiong,et al.  Confidence Preserving Machine for Facial Action Unit Detection , 2016, IEEE Trans. Image Process..

[16]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Shiguang Shan,et al.  AU-aware Deep Networks for facial expression recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[18]  Honggang Zhang,et al.  Joint patch and multi-label learning for facial action unit detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Maja Pantic,et al.  A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[21]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[23]  Shuo Yang,et al.  From Facial Parts Responses to Face Detection: A Deep Learning Approach , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[25]  Mohamed Chetouani,et al.  Facial Action Unit intensity prediction via Hard Multi-Task Metric Learning for Kernel Regression , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[26]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Fernando De la Torre,et al.  Facial Action Unit Event Detection by Cascade of Tasks , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Xiaogang Wang,et al.  Saliency detection by multi-context deep learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Shaun J. Canavan,et al.  Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Zheng Zhang,et al.  FERA 2017 - Addressing Head Pose in the Third Facial Expression Recognition and Analysis Challenge , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[32]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Rainer Stiefelhagen,et al.  Action unit intensity estimation using hierarchical partial least squares , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[35]  Maja Pantic,et al.  Multi-conditional Latent Variable Model for Joint Facial Action Unit Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Jun Wang,et al.  A 3D facial expression database for facial behavior research , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[37]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[38]  Daniel McDuff,et al.  Exploiting sparsity and co-occurrence structure for action unit recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[39]  Zhigang Zhu,et al.  A Deep Feature based Multi-kernel Learning Approach for Video Emotion Recognition , 2015, ICMI.

[40]  Lijun Yin,et al.  Multi-view facial expression recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.