Low-latency Block-wise Object Detection Method using SSD for High Resolution Video

In recent years, in the fields such as surveillance cameras and in-vehicle camera systems, efficient deep-learning-based object detection methods, such as Single Shot MultiBox Detector (SSD), that do not require window scanning have received a significant attention. However, these methods require a lot of memory and computation. For this reason, when we applying them to higher definition video, it can be necessary to divide the video into multiple blocks for inference processing due to restrictions on memory capacity of GPUs or FPGAs. To avoid accuracy degeneration due to block division, we can use conventional overlapping block technique. However, it increases the latency of object detection because we need to process more blocks. In this paper, we propose a low-latency block-wise object detection method which assigns a different block pattern into each frame, divides each frame based on the block pattern assignment, and integrates the results of multiple frames. In the experiments, the object detection accuracy and latency were evaluated using three data from the Multiple Object Tracking Benchmark dataset 2017. When the movement of object is small, we reduced the latency of human detection by about 40% while the accuracy degeneration is 0% to 2%.

[1]  J. Ferryman,et al.  PETS2009: Dataset and challenge , 2009, 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance.

[2]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[9]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Ying Chen,et al.  M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network , 2018, AAAI.

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.