Robust person head detection based on multi-scale representation fusion of deep convolution neural network

Person head detection is still a challenge due to the large variability in heads' sizes and orientations, lighting conditions and strong occlusions. Small heads require local information contained in low level layers instead of semantic features of upper layers. But most of these fine details are lost in the early convolutional layers of the deep convolution neural networks (DCNN). In order to improve the overall detection accuracy, it is important to utilize local information from lower layers into the detection framework. In this letter, we use multi-scale representation fusion of DCNN as a way to incorporate lower layers with upper layers for detection. Our proposed model is based on the recent object detection network Single Shot MultiBox Detector (SSD). VGG16 is used as the base network. Batch normalization (BN) layers are used in our proposed multi-task learning method to accelerate training process and improve the robustness. Compared to state-of-the-art methods, our proposed detector achieves superior person head detection performance on the HollywoodHeads dataset (81.0 AP) and Casablance dataset (78.5 AP).

[1]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Luc Van Gool,et al.  Face Detection without Bells and Whistles , 2014, ECCV.

[3]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Ivan Laptev,et al.  Context-Aware CNNs for Person Head Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[8]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[9]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[10]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Fuchun Sun,et al.  HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jin Chen,et al.  A hybrid convolutional neural networks with extreme learning machine for WCE image classification , 2015, 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[18]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[20]  Gang Wang,et al.  Door recognition and deep learning algorithm for visual based robot navigation , 2014, 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014).

[21]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Feng Zhang,et al.  STD: A Stereo Tracking Dataset for evaluating binocular tracking algorithms , 2016, 2016 IEEE International Conference on Robotics and Biomimetics (ROBIO).