ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects across video frames. Detection boxes serve as the basis of both 2D and 3D MOT. The inevitable changing of detection scores leads to object missing after tracking. We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object missing and fragmented trajectories. The simple and generic data association strategy shows effectiveness under both 2D and 3D settings. In 3D scenarios, it is much easier for the tracker to predict object velocities in the world coordinate. We propose a complementary motion prediction strategy that incorporates the detected velocities with a Kalman filter to address the problem of abrupt motion and short-term disappearing. ByteTrackV2 leads the nuScenes 3D MOT leaderboard in both camera (56.4% AMOTA) and LiDAR (70.1% AMOTA) modalities. Furthermore, it is nonparametric and can be integrated with various detectors, making it appealing in real applications. The source code is released at https://github.com/ifzhang/ByteTrack-V2.

[1]  Errui Ding,et al.  CAPE: Camera View Position Embedding for Multi-View 3D Object Detection , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yuxuan Xia,et al.  GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker Without Bells and Whistles , 2022, IEEE Transactions on Intelligent Vehicles.

[3]  Wenjun Zeng,et al.  VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Trevor Darrell,et al.  Monocular Quasi-Dense 3D Object Tracking , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Xiaojuan Qi,et al.  Spatial Pruned Sparse Convolution for Efficient 3D Object Detection , 2022, NeurIPS.

[6]  Zeming Li,et al.  Quality Matters: Embracing Quality Clues for Robust 3D Multi-Object Tracking , 2022, ArXiv.

[7]  Xinggang Wang,et al.  Robust Multi-Object Tracking by Marginal Inference , 2022, ECCV.

[8]  Aljosa Osep,et al.  PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? , 2022, ECCV.

[9]  P. Luo,et al.  Towards Grand Unification of Object Tracking , 2022, ECCV.

[10]  Jiaya Jia,et al.  Tracking Objects as Pixel-wise Distributions , 2022, ECCV.

[11]  H. Liao,et al.  YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Diange Yang,et al.  SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object Detection and Tracking , 2022, 2206.14451.

[13]  Chang Huang,et al.  Polar Parametrization for Vision-based Surround-View 3D Detection , 2022, ArXiv.

[14]  Zeming Li,et al.  BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[15]  Jian Sun,et al.  PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  L. Gool,et al.  TripletTrack: 3D Object Tracking using Triplet Embeddings and LSTM , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Jiaya Jia,et al.  Unifying Voxel-based Representation with Transformer for 3D Object Detection , 2022, NeurIPS.

[18]  Jieyu Jin,et al.  Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yilun Wang,et al.  MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[21]  Errui Ding,et al.  Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Philipp Krähenbühl,et al.  Global Tracking Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[25]  H. Shum,et al.  DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , 2022, ICLR.

[26]  Ziqi Pang,et al.  SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking , 2021, ECCV Workshops.

[27]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[28]  Edward Johns,et al.  AGO-Net: Association-Guided 3D Point Cloud Object Detection Network , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Liusheng Huang,et al.  Segment as Points for Efficient and Effective Online Multi-Object Tracking and Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  X. Zhang,et al.  MOTR: End-to-End Multiple-Object Tracking with TRansformer , 2021, ECCV.

[31]  Luc Van Gool,et al.  Learnable Online Graph Representations for 3D Multi-Object Tracking , 2021, IEEE Robotics and Automation Letters.

[32]  Weiming Hu,et al.  One More Check: Making "Fake Background" Be Tracked Again , 2021, AAAI.

[33]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Zhipeng Zhang,et al.  Rethinking the Competition Between Detection and ReID in Multiobject Tracking , 2020, IEEE Transactions on Image Processing.

[35]  Yuntao Chen,et al.  Immortal Tracker: Tracklet Never Dies , 2021, ArXiv.

[36]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[37]  Xiaoqing Ye,et al.  The Devil is in the Task: Exploiting Reciprocal Appearance-Localization Features for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Depu Meng,et al.  Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[40]  Luca Bertinetto,et al.  Do Different Tracking Tasks Require Different Appearance Models? , 2021, NeurIPS.

[41]  Hanqing Lu,et al.  Improving Multiple Object Tracking with Single Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Kris Kitani,et al.  Joint Object Detection and Multi-Object Tracking with Graph Neural Networks , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Davide Modolo,et al.  SiamMOT: Siamese Multi-Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xinge Zhu,et al.  FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[45]  Yinghui Xu,et al.  Multiple Object Tracking with Correlation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jiwen Lu,et al.  Objects are Different: Flexible Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Zeming Li,et al.  OTA: Optimal Transport Assignment for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Wolfram Burgard,et al.  Learning to Track with Object Permanence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Junsong Yuan,et al.  Track to Detect and Segment: An Online Multi-Object Tracker , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Steven L. Waslander,et al.  Categorical Depth Distribution Network for Monocular 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  J. Beveridge,et al.  DEFT: Detection Embeddings for Tracking , 2021, ArXiv.

[52]  Jeannette Bohg,et al.  Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[53]  Satoshi Nakamura,et al.  ReMOT: A model-agnostic refinement for multiple object tracking , 2020, Image Vis. Comput..

[54]  Ping Luo,et al.  What Makes for End-to-End Object Detection? , 2020, ICML.

[55]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[57]  Philip H. S. Torr,et al.  HOTA: A Higher Order Metric for Evaluating Multi-object Tracking , 2020, International Journal of Computer Vision.

[58]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[60]  D. Rus,et al.  TransCenter: Transformers with Dense Queries for Multiple-Object Tracking , 2021, ArXiv.

[61]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[63]  Jianren Wang,et al.  AB3DMOT: A Baseline for 3D Multi-Object Tracking and New Evaluation Metrics , 2020, ArXiv.

[64]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[65]  Feiyue Huang,et al.  Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking , 2020, ECCV.

[66]  Bodo Rosenhahn,et al.  Lifted Disjoint Paths with Application in Multiple Object Tracking , 2020, ICML.

[67]  Tao Mei,et al.  FastReID: A Pytorch Toolbox for General Instance Re-identification , 2020, ArXiv.

[68]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Cewu Lu,et al.  TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[71]  Nicu Sebe,et al.  Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events , 2020, ArXiv.

[72]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[73]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[74]  Zhichao Lu,et al.  RetinaTrack: Online Single Stage Joint Detection and Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Daniel Cremers,et al.  MOT20: A benchmark for multi object tracking in crowded scenes , 2020, ArXiv.

[76]  L. Leal-Taix'e,et al.  Learning a Neural Solver for Multiple Object Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Jongyoul Park,et al.  CenterMask: Real-Time Anchor-Free Instance Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Shengjin Wang,et al.  Towards Real-Time Multi-Object Tracking , 2019, ECCV.

[79]  David Held,et al.  3D Multi-Object Tracking: A Baseline and New Evaluation Metrics , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[80]  Yan Wang,et al.  Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving , 2019, ICLR.

[81]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[85]  Liang Du,et al.  Monocular 3D Object Detection via Feature Domain Adaptation , 2020, European Conference on Computer Vision.

[86]  Shubhra Aich,et al.  RefinedMPL: Refined Monocular PseudoLiDAR for 3D Object Detection in Autonomous Driving , 2019, ArXiv.

[87]  Krzysztof Czarnecki,et al.  FANTrack: 3D Multi-Object Tracking with Feature Association Network , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[88]  Yue Cao,et al.  Spatial-Temporal Relation Networks for Multi-Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[89]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[90]  Haibin Ling,et al.  FAMNet: Joint Learning of Feature, Affinity and Multi-Dimensional Assignment for Online Multiple Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[91]  Kris Kitani,et al.  Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[92]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[93]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Yan Wang,et al.  Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[98]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[99]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[100]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[101]  Bin Yang,et al.  Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[102]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[103]  Xiangyu Zhang,et al.  CrowdHuman: A Benchmark for Detecting Human in a Crowd , 2018, ArXiv.

[104]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[105]  Moe Z. Win,et al.  Message Passing Algorithms for Scalable Multitarget Tracking , 2018, Proceedings of the IEEE.

[106]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[107]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[108]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[109]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[110]  Volker Eiselein,et al.  High-Speed tracking-by-detection without using image information , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[111]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[112]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[113]  Bernt Schiele,et al.  CityPersons: A Diverse Dataset for Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[115]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[116]  Fan Yang,et al.  Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[117]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[118]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[119]  Huimin Ma,et al.  3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[120]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Kuk-Jin Yoon,et al.  Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[122]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[123]  Konrad Schindler,et al.  Continuous Energy Minimization for Multitarget Tracking , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[124]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[125]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[126]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[127]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[128]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .