Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Head pose is a vital indicator of human attention and behavior. Therefore, automatic estimation of head pose from images is key to many applications. In this paper, we propose a novel approach for head pose estimation from a single RGB image. Many existing approaches often predict head poses by localizing facial landmarks and then solve 2D to 3D correspondence problem with a mean head model. Such approaches rely entirely on the landmark detection accuracy, an ad-hoc alignment step, and the extraneous head model. To address this drawback, we present an end-to-end deep network, which explores rotation axis (yaw, pitch and roll) focused innovative attention mechanism to capture the subtle changes in images. The mechanism uses attentional spatial pooling from a self-attention layer and learns the importance over fine-grained to coarse spatial structures and combine them to capture rich semantic information concerning a given rotation axis. The evaluation of our approach using three benchmark datasets is very competitive to state-of-the-arts, including with and without landmark-based methods. Code can be found at https://github.com/ArdhenduBehera/RAFA-Net.

[1]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[2]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[3]  Stefanos Zafeiriou,et al.  A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild” , 2016, International Journal of Computer Vision.

[4]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[5]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[6]  Jan Kautz,et al.  Dynamic Facial Analysis: From Bayesian Filtering to Recurrent Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Cheng Cheng,et al.  A Deep Regression Architecture with Two-Stage Re-initialization for High Performance Facial Landmark Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jean-Marc Odobez,et al.  Robust and Accurate 3D Head Pose Estimation through 3DMM and Online Head Model Reconstruction , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[9]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[10]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luc Van Gool,et al.  Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[12]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Pi-Cheng Hsiu,et al.  SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation , 2018, IJCAI.

[14]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Jian Sun,et al.  Face Alignment Via Component-Based Discriminative Search , 2008, ECCV.

[16]  Ramakant Nevatia,et al.  FacePoseNet: Making a Case for Landmark-Free Face Alignment , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[17]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[18]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[19]  Takayuki Okatani,et al.  Improving Head Pose Estimation with a Combined Loss and Bounding Box Margin Adjustment , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[22]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Ardhendu Behera,et al.  A CNN Model for Head Pose Recognition using Wholes and Regions , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[24]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[28]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[31]  Larry S. Davis,et al.  Model-based object pose in 25 lines of code , 1992, International Journal of Computer Vision.

[32]  Michael Goesele,et al.  Detail-Preserving Pooling in Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Carlos D. Castillo,et al.  An All-In-One Convolutional Neural Network for Face Analysis , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[38]  Jan Kautz,et al.  Robust Model-Based 3D Head Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Jörn Ostermann,et al.  Deep Head Pose Estimation Using Synthetic Images and Partial Adversarial Domain Adaption for Continuous Label Spaces , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Yu-Wing Tai,et al.  Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Yu Cheng,et al.  S3Pool: Pooling with Stochastic Spatial Sampling , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[46]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[47]  Neil Martin Robertson,et al.  Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[48]  Rainer Stiefelhagen,et al.  Real Time Head Model Creation and Head Pose Estimation on Consumer Depth Cameras , 2014, 2014 2nd International Conference on 3D Vision.

[49]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[50]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[51]  Rama Chellappa,et al.  KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[52]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[53]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yung-Yu Chuang,et al.  FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Vadim V. Strijov,et al.  Position-Based Content Attention for Time Series Forecasting with Sequence-to-Sequence RNNs , 2017, ICONIP.

[56]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.