A Light CNN based Method for Hand Detection and Orientation Estimation

Hand detection is an essential step to support many tasks including HCI applications. However, detecting various hands robustly under conditions of cluttered backgrounds, motion blur or changing light is still a challenging problem. Recently, object detection methods using CNN models have significantly improved the accuracy of hand detection yet at a high computational expense. In this paper, we propose a light CNN network, which uses a modified MobileNet as the feature extractor in company with the SSD framework to achieve a robust and fast detection of hand location and orientation. The network generates a set of feature maps of various resolutions to detect hands of different sizes. In order to improve the robustness, we also employ a top-down feature fusion architecture that integrates context information across levels of features. For an accurate estimation of hand orientation by CNN, we manage to estimate two orthogonal vectors' projections along the horizontal and vertical axes then recover the size and orientation of a bounding box exactly enclosing the hand. Evaluated on the challenging Oxford hand dataset, our method reaches 83.2% average precision (AP) at 139 FPS on a Nvidia Titan X, outperforming the previous methods both in accuracy and efficiency.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Manolis I. A. Lourakis,et al.  Real-Time Tracking of Multiple Skin-Colored Objects with a Possibly Moving Camera , 2004, ECCV.

[4]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[5]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[6]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  Wei Liu,et al.  DSSD : Deconvolutional Single Shot Detector , 2017, ArXiv.

[10]  Yinda Zhang,et al.  Joint Hand Detection and Rotation Estimation by Using CNN , 2016, ArXiv.

[11]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[12]  Qing Chen,et al.  Hand Gesture Recognition Using Haar-Like Features and a Stochastic Context-Free Grammar , 2008, IEEE Transactions on Instrumentation and Measurement.

[13]  Marios Savvides,et al.  Robust hand detection in Vehicles , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[14]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[15]  Ai Poh Loh,et al.  Attention Based Detection and Recognition of Hand Postures Against Complex Backgrounds , 2012, International Journal of Computer Vision.

[16]  Nikos A. Nikolaou,et al.  Real time hand detection in a complex background , 2014, Eng. Appl. Artif. Intell..

[17]  Marios Savvides,et al.  Robust Hand Detection and Classification in Vehicles and in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[19]  Yichao Huang,et al.  A Pointing Gesture Based Egocentric Interaction System: Dataset, Approach and Application , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jitendra Malik,et al.  Beyond Skip Connections: Top-Down Modulation for Object Detection , 2016, ArXiv.