Joint Monocular 3D Vehicle Detection and Tracking

Vehicle 3D extents and trajectories are critical cues for predicting the future location of vehicles and planning future agent ego-motion based on those predictions. In this paper, we propose a novel online framework for 3D vehicle detection and tracking from monocular videos. The framework can not only associate detections of vehicles in motion over time, but also estimate their complete 3D bounding box information from a sequence of 2D images captured on a moving platform. Our method leverages 3D box depth-ordering matching for robust instance association and utilizes 3D trajectory prediction for re-identification of occluded vehicles. We also design a motion learning module based on an LSTM for more accurate long-term motion extrapolation. Our experiments on simulation, KITTI, and Argoverse datasets show that our 3D tracking pipeline offers robust data association and tracking. On Argoverse, our image-based method is significantly better for tracking 3D vehicles within 30 meters than the LiDAR-centric baseline methods.

[1]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[4]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[5]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[8]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[10]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[11]  Yu Liu,et al.  POI: Multiple Object Tracking with High Performance Detection and Appearance Feature , 2016, ECCV Workshops.

[12]  Tankut Acarman,et al.  A Lightweight Online Multiple Object Vehicle Tracking Method , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[13]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[15]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[16]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[18]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[19]  Simone Calderara,et al.  Visual Tracking: An Experimental Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ming-Hsuan Yang,et al.  Online Multi-object Tracking via Structural Constraint Event Aggregation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[23]  Philipp Krähenbühl,et al.  Free Supervision from Video Games , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Karl Granström,et al.  Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[26]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[27]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Bastian Leibe,et al.  Combined image- and world-space tracking in traffic scenes , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[34]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Namil Kim,et al.  Multispectral pedestrian detection: Benchmark dataset and baseline , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yu-Wing Tai,et al.  Accurate Single Stage Detector Using Recurrent Rolling Convolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Shaojie Shen,et al.  Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving , 2018, ECCV.

[38]  Wongun Choi,et al.  Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  K. Madhava Krishna,et al.  Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Raquel Urtasun,et al.  End-to-end Learning of Multi-sensor 3D Tracking by Detection , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Michael Felsberg,et al.  The Visual Object Tracking VOT2015 Challenge Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[43]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[44]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[45]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[48]  Andrea Cavallaro,et al.  Accepted for Publication in Ieee Transactions on Image Processing Adaptive Appearance Modeling for Video Tracking: Survey and Evaluation , 2022 .

[49]  Ming-Hsuan Yang,et al.  UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking , 2015, Comput. Vis. Image Underst..

[50]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[52]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andreas Geiger,et al.  FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[55]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).