Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets

Head poses are a key component of human bodily communication and thus a decisive element of human-computer interaction. Real-time head pose estimation is crucial in the context of human-robot interaction or driver assistance systems. The most promising approaches for head pose estimation are based on Convolutional Neural Networks (CNNs). However, CNN models are often too complex to achieve real-time performance. To face this challenge, we explore a popular subgroup of CNNs, the Residual Networks (ResNets) and modify them in order to reduce their number of parameters. The ResNets are modifed for different image sizes including low-resolution images and combined with a varying number of layers. They are trained on in-the-wild datasets to ensure real-world applicability. As a result, we demonstrate that the performance of the ResNets can be maintained while reducing the number of parameters. The modified ResNets achieve state-of-the-art accuracy and provide fast inference for real-time applicability.

[1]  Wei Zhang,et al.  Cross-Cascading Regression for Simultaneous Head Pose Estimation and Facial Landmark Detection , 2018, CCBR.

[2]  V. Ferrario,et al.  Active range of motion of the head and cervical spine: a three‐dimensional investigation in healthy young adults , 2002, Journal of orthopaedic research : official publication of the Orthopaedic Research Society.

[3]  Angelo Cangelosi,et al.  Head pose estimation in the wild using Convolutional Neural Networks and adaptive gradient methods , 2017, Pattern Recognit..

[4]  R. Stiefelhagen Estimating Head Pose with Neural Networks-Results on the Pointing 04 ICPR Workshop Evaluation Data , 2004 .

[5]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[9]  David Beymer,et al.  Face recognition under varying pose , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Matei Mancas,et al.  Second screen interaction: an approach to infer tv watcher's interest using 3d head pose estimation , 2013, WWW '13 Companion.

[11]  Rama Chellappa,et al.  KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[12]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[13]  Luc Van Gool,et al.  Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[14]  Larry S. Davis,et al.  Model-based object pose in 25 lines of code , 1992, International Journal of Computer Vision.

[15]  吉浜勇树 Driving assistance system , 2011 .

[16]  William T. Freeman,et al.  Example-based head tracking , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[17]  Bernt Schiele,et al.  Ten Years of Pedestrian Detection, What Have We Learned? , 2014, ECCV Workshops.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  J. Turner Human Emotions: A Sociological Theory , 2007 .

[20]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Hao Wu,et al.  Simultaneous Face Detection and Pose Estimation Using Convolutional Neural Network Cascade , 2018, IEEE Access.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  James Diebel,et al.  Representing Attitude : Euler Angles , Unit Quaternions , and Rotation Vectors , 2006 .

[24]  Rolf Baxter,et al.  Detecting Social Groups in Crowded Surveillance Videos Using Visual Attention , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[25]  Vincent Lepetit,et al.  Monocular Model-Based 3D Tracking of Rigid Objects: A Survey , 2005, Found. Trends Comput. Graph. Vis..

[26]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[27]  Luc Van Gool,et al.  Real time head pose estimation with random regression forests , 2011, CVPR 2011.

[28]  Shaogang Gong,et al.  Support vector regression and classification based multi-view face detection and recognition , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[29]  Raymond H. Cuijpers,et al.  Head pose estimation for a domestic robot , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30]  Alex Waibel,et al.  Gaze Tracking Based on Face‐Color , 1995 .

[31]  David Gerónimo Gómez,et al.  Survey of Pedestrian Detection for Advanced Driver Assistance Systems , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Sheng Wan,et al.  QuatNet: Quaternion-Based Head Pose Estimation With Multiregression Loss , 2019, IEEE Transactions on Multimedia.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  Witold Pedrycz,et al.  A central profile-based 3D face pose estimation , 2014, Pattern Recognit..