论文信息 - Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Head pose is a vital indicator of human attention and behavior. Therefore, automatic estimation of head pose from images is key to many applications. In this paper, we propose a novel approach for head pose estimation from a single RGB image. Many existing approaches often predict head poses by localizing facial landmarks and then solve 2D to 3D correspondence problem with a mean head model. Such approaches rely entirely on the landmark detection accuracy, an ad-hoc alignment step, and the extraneous head model. To address this drawback, we present an end-to-end deep network, which explores rotation axis (yaw, pitch and roll) focused innovative attention mechanism to capture the subtle changes in images. The mechanism uses attentional spatial pooling from a self-attention layer and learns the importance over fine-grained to coarse spatial structures and combine them to capture rich semantic information concerning a given rotation axis. The evaluation of our approach using three benchmark datasets is very competitive to state-of-the-arts, including with and without landmark-based methods. Code can be found at https://github.com/ArdhenduBehera/RAFA-Net.

[1] Xiaoou Tang,et al. Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[2] Stefanos Zafeiriou,et al. 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[3] Stefanos Zafeiriou,et al. A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild” , 2016, International Journal of Computer Vision.

[4] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[5] Jian Sun,et al. Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[6] Jan Kautz,et al. Dynamic Facial Analysis: From Bayesian Filtering to Recurrent Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Cheng Cheng,et al. A Deep Regression Architecture with Two-Stage Re-initialization for High Performance Facial Landmark Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jean-Marc Odobez,et al. Robust and Accurate 3D Head Pose Estimation through 3DMM and Online Head Model Reconstruction , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[9] Harm de Vries,et al. RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[10] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Luc Van Gool,et al. Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[12] Stephen Lin,et al. Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Pi-Cheng Hsiu,et al. SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation , 2018, IJCAI.

[14] James M. Rehg,et al. Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15] Jian Sun,et al. Face Alignment Via Component-Based Discriminative Search , 2008, ECCV.

[16] Ramakant Nevatia,et al. FacePoseNet: Making a Case for Landmark-Free Face Alignment , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[17] Simao Herdade,et al. Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[18] Simon Baker,et al. Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[19] Takayuki Okatani,et al. Improving Head Pose Estimation with a Combined Loss and Bounding Box Margin Adjustment , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[20] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21] Zhuowen Tu,et al. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[22] Yi Li,et al. Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Ardhendu Behera,et al. A CNN Model for Head Pose Recognition using Wholes and Regions , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[24] Josephine Sullivan,et al. One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Chuang Gan,et al. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[28] Tomás Pajdla,et al. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[31] Larry S. Davis,et al. Model-based object pose in 25 lines of code , 1992, International Journal of Computer Vision.

[32] Michael Goesele,et al. Detail-Preserving Pooling in Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Horst Bischof,et al. Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34] Xiangyu Zhu,et al. Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Carlos D. Castillo,et al. An All-In-One Convolutional Neural Network for Face Analysis , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[36] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[37] Ruslan Salakhutdinov,et al. Action Recognition using Visual Attention , 2015, NIPS 2015.

[38] Jan Kautz,et al. Robust Model-Based 3D Head Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39] Jörn Ostermann,et al. Deep Head Pose Estimation Using Synthetic Images and Partial Adversarial Domain Adaption for Continuous Label Spaces , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Deva Ramanan,et al. Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Yu-Wing Tai,et al. Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Rama Chellappa,et al. HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Georgios Tzimiropoulos,et al. How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Yu Cheng,et al. S3Pool: Pooling with Stochastic Spatial Sampling , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Deva Ramanan,et al. Attentional Pooling for Action Recognition , 2017, NIPS.

[46] Timothy F. Cootes,et al. Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[47] Neil Martin Robertson,et al. Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[48] Rainer Stiefelhagen,et al. Real Time Head Model Creation and Head Pose Estimation on Consumer Depth Cameras , 2014, 2014 2nd International Conference on 3D Vision.

[49] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[50] Wenjun Zeng,et al. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[51] Rama Chellappa,et al. KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[52] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Yung-Yu Chuang,et al. FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Vadim V. Strijov,et al. Position-Based Content Attention for Time Series Forecasting with Sequence-to-Sequence RNNs , 2017, ICONIP.

[56] Geoffrey E. Hinton,et al. Dynamic Routing Between Capsules , 2017, NIPS.