Speed-Up of Object Detection Neural Network with GPU

We realized a speed-up of an object detection neural network with GPU. We improved the object detection speed of faster R-CNN [1], which is one of the most commonly used detection networks [2]. The speed of the original faster R-CNN (py - faster - rcnn [3]) was 72.4ms per image on our GPU server 11OS: Ubuntu 14.04.5 LTS (GNU/Linux 4.2.0-42-generic x86_64), CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, GPU: GPU (Tesla P100-PCIE-16GB) x 1, Libraries: MKL, CUDA 8.0, cuDNN v5.1.5. We accelerated the detection speed by implementing our new algorithms that are suitable for GPUs. The speed-up is realized without sacrificing the object detection accuracy (mAP). Our GPU-accelerated faster R-CNN can detect objects with 55.8ms per image. This is nearly 30% speed-up. In detection networks, the processes of building scored candidate regions, sorting and non-maximum-suppression (nms) are commonly used. In faster R-CNN, these processes are executed in proposal layer. We reduced the processing time of the proposal layer from 5.6ms to 2.2ms. This is 2.5 times as fast as the original one. We also evaluated the detection speed with larger batch sizes. By applying batch size 16, it is accelerated to 44.9ms per image. This is 1.6 times as fast as the original faster R-CNN (py-faster-rcnn). Since we realized a speed-up of common basic methods for detection networks, our speed-up methods are also applicable to other detection networks such as R-FCN [4], YOLO [5] [6] and SSD [7].

[1]  Yeongjae Cheon,et al.  PVANet: Lightweight Deep Neural Networks for Real-time Object Detection , 2016, ArXiv.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[5]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[6]  Norbert Luttenberger,et al.  A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[9]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[12]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[14]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  S. Winograd Arithmetic complexity of computations , 1980 .

[16]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[17]  Hirotaka Tamura,et al.  Fast algorithm using summed area tables with unified layer performing convolution and average pooling , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[18]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[19]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).