MOS: A Low Latency and Lightweight Framework for Face Detection, Landmark Localization, and Head Pose Estimation

With the emergence of service robots and surveillance cameras, dynamic face recognition (DFR) in wild has received much attention in recent years. Face detection and head pose estimation are two important steps for DFR. Very often, the pose is estimated after the face detection. However, such sequential computations lead to higher latency. In this paper, we propose a low latency and lightweight network for simultaneous face detection, landmark localization and head pose estimation. Inspired by the observation that it is more challenging to locate the facial landmarks for faces with large angles, a pose loss is proposed to constrain the learning. Moreover, we also propose an uncertainty multi-task loss to learn the weights of individual tasks automatically. Another challenge is that robots often use low computational units like ARM based computing core and we often need to use lightweight networks instead of the heavy ones, which lead to performance drop especially for small and hard faces. In this paper, we propose online feedback sampling to augment the training samples across different scales, which increases the diversity of training data automatically. Through validation in commonly used WIDER FACE, AFLW and AFLW2000 datasets, the results show that the proposed method achieves the state-of-the-art performance in low computational resources.

[1]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[2]  Shiguang Shan,et al.  Mean-Variance Loss for Deep Age Estimation from a Face , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  J. Gregson,et al.  WHENet: Real-time Fine-Grained Estimation for Wide Range Head Pose , 2020, BMVC.

[4]  Erik Learned-Miller,et al.  FDDB: A benchmark for face detection in unconstrained settings , 2010 .

[5]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Tal Hassner,et al.  img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Changsheng Xu,et al.  Joint Pose and Expression Modeling for Facial Expression Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Ran He,et al.  PyramidBox++: High Performance Detector for Finding Tiny Face , 2019, ArXiv.

[10]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[11]  Sheng Wan,et al.  QuatNet: Quaternion-Based Head Pose Estimation With Multiregression Loss , 2019, IEEE Transactions on Multimedia.

[12]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[13]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[16]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[17]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xu Tang,et al.  PyramidBox: A Context-assisted Single Shot Face Detector , 2018, ECCV.

[22]  Xiangyu Zhang,et al.  Bounding Box Regression With Uncertainty for Accurate Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[24]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Debing Zhang,et al.  Lightweight Face Recognition Challenge , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[26]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Wangkit Wong,et al.  RankPose: Learning Generalised Feature with Rank Supervision for Head Pose Estimation , 2020, BMVC.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[31]  Shifeng Zhang,et al.  FaceBoxes: A CPU real-time face detector with high accuracy , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).

[32]  Yung-Yu Chuang,et al.  FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Rongrong Ji,et al.  ASFD: Automatic and Scalable Face Detector , 2020, ACM Multimedia.

[35]  Stan Z. Li,et al.  FLDet: A CPU Real-time Joint Face and Landmark Detector , 2019, 2019 International Conference on Biometrics (ICB).

[36]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Larry S. Davis,et al.  SSH: Single Stage Headless Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Rama Chellappa,et al.  KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[39]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[41]  Shifeng Zhang,et al.  Single-Shot Refinement Neural Network for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Shiming Xiang,et al.  LFFD: A Light and Fast Face Detector for Edge Devices , 2019, ArXiv.

[47]  Lu Tian,et al.  ProgressFace: Scale-Aware Progressive Learning for Face Detection , 2020, ECCV.

[48]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Yang Liu,et al.  HAMBox: Delving Into Mining High-Quality Anchors on Face Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Shifeng Zhang,et al.  Selective Refinement Network for High Performance Face Detection , 2018, AAAI.

[51]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).