Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing

Object detection in videos is an important task in computer vision for various applications such as object tracking, video summarization and video search. Although great progress has been made in improving the accuracy of object detection in recent years due to the rise of deep neural networks, the state-of-the-art algorithms are highly computationally intensive. In order to address this challenge, we make two important observations in the context of videos: (i) Objects often occupy only a small fraction of the area in each video frame, and (ii) There is a high likelihood of strong temporal correlation between consecutive frames. Based on these observations, we propose Pack and Detect (PaD), an approach to reduce the computational requirements of object detection in videos. In PaD, only selected video frames called anchor frames are processed at full size. In the frames that lie between anchor frames (inter-anchor frames), regions of interest (ROIs) are identified based on the detections in the previous frame. We propose an algorithm to pack the ROIs of each inter-anchor frame together into a reduced-size frame. The computational requirements of the detector are reduced due to the lower size of the input. In order to maintain the accuracy of object detection, the proposed algorithm expands the ROIs greedily to provide additional background around each object to the detector. PaD can use any underlying neural network architecture to process the full-size and reduced-size frames. Experiments using the ImageNet video object detection dataset indicate that PaD can potentially reduce the number of FLOPS required for a frame by 4×. This leads to an overall increase in throughput of 1.25× on a 2.1 GHz Intel Xeon server with a NVIDIA Titan X GPU at the cost of 1.1% drop in accuracy.

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[4]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Junjie Yan,et al.  Impression Network for Video Object Detection , 2017, ArXiv.

[6]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[7]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[8]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[9]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Christopher Joseph Pal,et al.  RATM: Recurrent Attentive Tracking Model , 2015, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[12]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[13]  Sungroh Yoon,et al.  Big/little deep neural network for ultra low power inference , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[14]  Konrad Schindler,et al.  Online Multi-Target Tracking Using Recurrent Neural Networks , 2016, AAAI.

[15]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Balaraman Ravindran,et al.  Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Jie Liu,et al.  Scalable-effort classifiers for energy-efficient machine learning , 2015, DAC.

[20]  Brian D. Ehret Learning where to look , 1999, CHI EA '99.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Alexander Wong,et al.  Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video , 2017, ArXiv.

[23]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[24]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[25]  Wenjun Zeng,et al.  Object Detection in Videos by Short and Long Range Object Linking , 2018, ArXiv.

[26]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[27]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Junzhi Yu,et al.  TSSD: Temporal Single-Shot Object Detection Based on Attention-Aware LSTM , 2018, ArXiv.

[29]  Jia Deng,et al.  Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution , 2017, AAAI.

[30]  Bolei Zhou,et al.  Recurrent Residual Module for Fast Inference in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[32]  Wonyong Sung,et al.  Fixed point optimization of deep convolutional neural networks for object recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[34]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[38]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[40]  Anand Raghunathan,et al.  DyVEDeep: Dynamic Variable Effort Deep Neural Networks , 2017, ArXiv.

[41]  Kaushik Roy,et al.  AxNN: Energy-efficient neuromorphic systems using approximate computing , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[42]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[44]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[45]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[46]  Menglong Zhu,et al.  Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Miodrag Potkonjak,et al.  Pruning Filters and Classes: Towards On-Device Customization of Convolutional Neural Networks , 2017, EMDL '17.

[48]  Konrad Schindler,et al.  Learning by Tracking: Siamese CNN for Robust Target Association , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[49]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[50]  Alex Bewley,et al.  Hierarchical Attentive Recurrent Tracking , 2017, NIPS.