Action Unit Detection with Region Adaptation, Multi-labeling Learning and Optimal Temporal Fusing

Action Unit (AU) detection becomes essential for facial analysis. Many proposed approaches face challenging problems in dealing with the alignments of different face regions, in the effective fusion of temporal information, and in training a model for multiple AU labels. To better address these problems, we propose a deep learning framework for AU detection with region of interest (ROI) adaptation, integrated multi-label learning, and optimal LSTM-based temporal fusing. First, an ROI cropping net is designed to make sure specific interested regions of faces are learned independently, each sub-region has a local convolutional neural network (CNN) whose convolutional filters will only be trained for the corresponding region. Second, multi-label learning is employed to integrate the outputs of those individual ROI cropping nets, which learns the inter-relationships of various AUs and acquires global features across sub-regions for AU detection. Finally, the optimal selection of multiple LSTM layers are carried out to best fuse temporal features, in order to make the AU prediction the most accurate. The proposed approach is evaluated on two popular AU detection datasets, BP4D and DISFA, outperforming the state of the art significantly, with an average improvement of around 13% in BP4D and 25% in DISFA, respectively.

[1]  Fernando De la Torre,et al.  Facial Action Unit Event Detection by Cascade of Tasks , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zhang Xiong,et al.  Confidence Preserving Machine for Facial Action Unit Detection , 2015, IEEE Transactions on Image Processing.

[4]  Daniel McDuff,et al.  Exploiting sparsity and co-occurrence structure for action unit recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[5]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Maja Pantic,et al.  Fully Automatic Facial Action Unit Detection and Temporal Analysis , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[7]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[8]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Simon Lucey,et al.  How much training data for facial action unit detection? , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[10]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rainer Stiefelhagen,et al.  Action unit intensity estimation using hierarchical partial least squares , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  Maja Pantic,et al.  Multi-conditional Latent Variable Model for Joint Facial Action Unit Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Lijun Yin,et al.  EAC-Net: A Region-Based Deep Enhancing and Cropping Approach for Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[14]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Honggang Zhang,et al.  Joint patch and multi-label learning for facial action unit detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Fernando De la Torre,et al.  Modeling Spatial and Temporal Cues for Multi-label Facial Action Unit Detection , 2016, ArXiv.

[19]  Shiguang Shan,et al.  AU-aware Deep Networks for facial expression recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[20]  Maja Pantic,et al.  A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[24]  Qiang Ji,et al.  Constrained Joint Cascade Regression Framework for Simultaneous Facial Action Unit Recognition and Facial Landmark Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Qiang Ji,et al.  Capturing Global Semantic Relationships for Facial Action Unit Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Fernando De la Torre,et al.  Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[27]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[32]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Aleix M. Martínez,et al.  EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).