An End-to-End Task-Simplified and Anchor-Guided Deep Learning Framework for Image-Based Head Pose Estimation

Image-based Head Pose Estimation (HPE) from an arbitrary view is still challenging due to the complex imaging conditions as well as the intrinsic and extrinsic property of the faces. Different from existing HPE methods combining additional cues or tasks, this paper solves the HPE problem by relieving problem complexity. Our method integrates the deep Task-Simplification oriented Image Regularization (TSIR) module with the Anchor-Guided Pose Estimation (AGPE) module, and formulate the HPE problem into a unified end-to-end learning framework. In this paper, we define anchors as images that strictly obey the “gravity rule in camera”, which follows the assumption that camera coordinate of the vertical axis should always be consistent with that of the local head coordinate. We formulate image pair as the regularized image produced by TSIR along with its anchor counterpart, both of which are fed into the AGPE module for estimating fine-grained head poses. This paper also proposes an Anchor-Guided Pairwise Loss (AGPL), which describes the interdependent relevance of poses between each pair of images. The proposed method is evaluated and validated with sufficient experiments which show its effectiveness. Comprehensive experiments show that our approach outperforms the state-of-the-art image-based methods on both indoor and outdoor datasets.

[1]  Jorge Batista,et al.  Accurate single view model-based head pose estimation , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[2]  Wei Liang,et al.  3D head pose estimation with convolutional neural network trained on synthetic images , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[3]  Chen Wu,et al.  Discovering social interactions in real work environments , 2011, Face and Gesture 2011.

[4]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[5]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Rama Chellappa,et al.  KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[8]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  Matei Mancas,et al.  Second screen interaction: an approach to infer tv watcher's interest using 3d head pose estimation , 2013, WWW '13 Companion.

[10]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Ioannis Pitas,et al.  3-D Head Pose Estimation in Monocular Video Sequences Using Deformable Surfaces and Radial Basis Functions , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Ardhendu Behera,et al.  A CNN Model for Head Pose Recognition using Wholes and Regions , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[16]  Luc Van Gool,et al.  Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[17]  Takahiro Okabe,et al.  Appearance-based head pose estimation with scene-specific adaptation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[18]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Debora Gil,et al.  Continuous Head Pose Estimation Using Manifold Subspace Embedding and Multivariate Regression , 2018, IEEE Access.

[20]  Lin Ma,et al.  PFLD: A Practical Facial Landmark Detector , 2019, ArXiv.

[21]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[22]  Yung-Yu Chuang,et al.  FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mohan M. Trivedi,et al.  Head Pose Estimation and Augmented Reality Tracking: An Integrated System and Evaluation for Monitoring Driver Awareness , 2010, IEEE Transactions on Intelligent Transportation Systems.

[24]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Libo Cao,et al.  Head Pose Estimation in the Wild Assisted by Facial Landmarks Based on Convolutional Neural Networks , 2019, IEEE Access.

[28]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Thomas S. Huang,et al.  Interactive Facial Feature Localization , 2012, ECCV.

[30]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Carlos D. Castillo,et al.  An All-In-One Convolutional Neural Network for Face Analysis , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[32]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[33]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[35]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[36]  Ramakant Nevatia,et al.  FacePoseNet: Making a Case for Landmark-Free Face Alignment , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[37]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[38]  Radu Horaud,et al.  Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions , 2016, IEEE Transactions on Image Processing.

[39]  Neil Martin Robertson,et al.  Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[40]  David J. Kriegman,et al.  Localizing Parts of Faces Using a Consensus of Exemplars , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.