MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.

[1]  Zeming Li,et al.  Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation , 2022, AAAI.

[2]  Yuang Zhang,et al.  MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Z. Tu,et al.  MeMOT: Multi-Object Tracking with Memory , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kris Kitani,et al.  Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Shoudong Han,et al.  Towards Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object Tracking , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  P. Luo,et al.  DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[10]  Hongwei Wang,et al.  RelationTrack: Relation-Aware Multiple Object Tracking With Decoupled Representation , 2021, IEEE Transactions on Multimedia.

[11]  X. Zhang,et al.  MOTR: End-to-End Multiple-Object Tracking with TRansformer , 2021, ECCV.

[12]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jun Zhao,et al.  MAT: Motion-Aware Multi-Object Tracking , 2020, Neurocomputing.

[14]  Gang Zeng,et al.  Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment , 2022, ArXiv.

[15]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[16]  Davide Modolo,et al.  SiamMOT: Siamese Multi-Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[20]  Philip H. S. Torr,et al.  HOTA: A Higher Order Metric for Evaluating Multi-object Tracking , 2020, International Journal of Computer Vision.

[21]  Trevor Darrell,et al.  Quasi-Dense Similarity Learning for Multiple Object Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[23]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[24]  Feiyue Huang,et al.  Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking , 2020, ECCV.

[25]  Kris Kitani,et al.  Joint Detection and Multi-Object Tracking with Graph Neural Networks , 2020, ArXiv.

[26]  Cewu Lu,et al.  TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[28]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[29]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[30]  Shengjin Wang,et al.  Towards Real-Time Multi-Object Tracking , 2019, ECCV.

[31]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[33]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[37]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[38]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[44]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[45]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[46]  J. L. Roux An Introduction to the Kalman Filter , 2003 .