A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

Average precision (AP) is a widely used metric to evaluate detection accuracy of image and video object detectors. In this paper, we analyze the object detection from video and point out that mAP alone is not sufficient to capture the temporal nature of video object detection. To tackle this problem, we propose a comprehensive metric, Average Delay (AD), to measure and compare detection delay. To facilitate delay evaluation, we carefully select a subset of ImageNet VID, which we name as ImageNet VIDT with an emphasis on complex trajectories. By extensively evaluating a wide range of detectors on VIDT, we show that most methods drastically increase the detection delay but still preserve mAP well. In other words, mAP is not sensitive enough to reflect the temporal characteristics of a video object detector. Our results suggest that video object detection methods should be evaluated with a delay metric, particularly for latency-critical applications such as autonomous vehicle perception.

[1]  Pavlo Molchanov,et al.  Budget-Aware Activity Detection with A Recurrent Policy Network , 2017, BMVC.

[2]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[3]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kai Chen,et al.  Optimizing Video Object Detection via a Scale-Time Lattice , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Ganesh Sundaramoorthi,et al.  Minimum Delay Object Detection From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Venkatesh Saligrama,et al.  Adaptive Neural Networks for Efficient Inference , 2017, ICML.

[8]  Lingjia Tang,et al.  The Architectural Implications of Autonomous Driving: Constraints and Acceleration , 2018, ASPLOS.

[9]  Junjie Yan,et al.  Impression Network for Video Object Detection , 2017, ArXiv.

[10]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[11]  Siddhartha S. Srinivasa,et al.  MOPED: A scalable and low latency object recognition and pose estimation system , 2010, 2010 IEEE International Conference on Robotics and Automation.

[12]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[14]  Pavlo Molchanov,et al.  Making Convolutional Networks Recurrent for Visual Sequence Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Menglong Zhu,et al.  Mobile Video Object Detection with Temporally-Aware Feature Maps , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[18]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Subutai Ahmad,et al.  Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[20]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[21]  Yichen Wei,et al.  Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jan Kautz,et al.  STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[25]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[29]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[30]  Yu Wang,et al.  Towards Real-Time Object Detection on Embedded Systems , 2018, IEEE Transactions on Emerging Topics in Computing.

[31]  William J. Dally,et al.  CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video , 2019, MLSys.

[32]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Ganesh Sundaramoorthi,et al.  Quickest Moving Object Detection , 2016, ArXiv.

[35]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[37]  Hao Luo,et al.  Detect or Track: Towards Cost-Effective Video Object Detection/Tracking , 2018, AAAI.