ThunderNet: Towards Real-Time Generic Object Detection on Mobile Devices

Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Code will be released for paper reproduction.

[1]  Weiyao Lin,et al.  Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages , 2018, BMVC.

[2]  Wei Liu,et al.  DSSD : Deconvolutional Single Shot Detector , 2017, ArXiv.

[3]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[4]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Larry S. Davis,et al.  Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Charles X. Ling,et al.  Pelee: A Real-Time Object Detection System on Mobile Devices , 2018, NeurIPS.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Xiangyu Zhang,et al.  DetNet: Design Backbone for Object Detection , 2018, ECCV.

[13]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[19]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[21]  Yuning Jiang,et al.  MegDet: A Large Mini-Batch Object Detector , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[24]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Rakesh Mehta,et al.  Object detection at 200 Frames Per Second , 2018, ECCV Workshops.

[28]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[30]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Xiangyu Zhang,et al.  Light-Head R-CNN: In Defense of Two-Stage Object Detector , 2017, ArXiv.

[33]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.