A fast and effective video vehicle detection method leveraging feature fusion and proposal temporal link

Vehicle detection in videos is a valuable but challenging technology in traffic monitoring. Due to the advantage of real-time detection, Single Shot MultiBox Detector (SSD) is often used to detect vehicles in images. However, the accuracy degradation caused by SSD is one of the significant problems in video vehicle detection. To address this problem in real time, this paper enhances the detection performance by improving the SSD and employing the relationship of inter-frame detections. We propose a feature-fused SSD detector and a Tracking-guided Detections Optimizing (TDO) strategy for fast and effective video vehicle detection. We introduce a lightweight feature fusion sub-network to the standard SSD network, which aggregate the deeper layer features into the shallower layer features to enhance the semantic information of the shallower layer features. At the post-processing stage of the feature-fused SSD, the non-maximum suppression (NMS) is replaced by the TDO strategy, which link vehicles of inter-frames by fast tracking algorithm. Thus the missed detections can be compensated by the propagated results, and the confidence of the final results can be optimized in the temporal. Our approach significantly improves the temporal consistency of the detection results with lower complexity computations. We evaluate the proposed method on two datasets. The experiments on our labeled highway dataset show that the mean average precision (mAP) of our method is 8.2% higher than that of the base detector. The runtime of our feature-fused SSD is 27.1 frames per second (fps), which is suitable for real-time detection. The experiments on the ImageNet VID dataset prove that the proposed method is comparable with the state-of-the-art detectors as well.

[1]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[2]  Guangyi Xiao,et al.  Real-time chinese traffic warning signs recognition based on cascade and CNN , 2020, Journal of Real-Time Image Processing.

[3]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[5]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Fei-Yue Wang,et al.  Data-Driven Intelligent Transportation Systems: A Survey , 2011, IEEE Transactions on Intelligent Transportation Systems.

[7]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[10]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Pheng-Ann Heng,et al.  SINet: A Scale-Insensitive Convolutional Neural Network for Fast Vehicle Detection , 2018, IEEE Transactions on Intelligent Transportation Systems.

[12]  Jun-Wei Hsieh,et al.  Symmetrical SURF and Its Applications to Vehicle Detection and Vehicle Make and Model Recognition , 2014, IEEE Transactions on Intelligent Transportation Systems.

[13]  Xiaogang Wang,et al.  Object Detection in Videos with Tubelet Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Enkhbayar Erdenee,et al.  Multi-class Multi-object Tracking Using Changing Point Detection , 2016, ECCV Workshops.

[15]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  S. Govindarajulu,et al.  A Comparison of SIFT, PCA-SIFT and SURF , 2012 .

[17]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[18]  Yong-Jin Jeong,et al.  Fast background subtraction with adaptive block learning using expectation value suitable for real-time moving object detection , 2021, J. Real Time Image Process..

[19]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[20]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Shiru Qu,et al.  Real-time vehicle detection and counting in complex traffic scenes using background subtraction model with low-rank decomposition , 2017 .

[23]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[24]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[25]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Yong Jae Lee,et al.  Video Object Detection with an Aligned Spatial-Temporal Memory , 2017, ECCV.

[27]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[28]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[29]  Zehang Sun,et al.  Monocular precrash vehicle detection: features and classifiers , 2006, IEEE Transactions on Image Processing.

[30]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[33]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Stefania Perri,et al.  Multimodal background subtraction for high-performance embedded systems , 2019, Journal of Real-Time Image Processing.

[35]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Yichen Wei,et al.  Towards High Performance Video Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Zezhi Chen,et al.  Vehicle detection, tracking and classification in urban traffic , 2012, 2012 15th International IEEE Conference on Intelligent Transportation Systems.

[40]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[42]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.