3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.

[1]  Jie Li,et al.  ShaSTA: Modeling Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking , 2022, IEEE Robotics and Automation Letters.

[2]  X. Zhang,et al.  CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion , 2022, IEEE Transactions on Intelligent Transportation Systems.

[3]  Aljosa Osep,et al.  PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? , 2022, ECCV.

[4]  Yuxuan Xia,et al.  GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker Without Bells and Whistles , 2022, IEEE Transactions on Intelligent Vehicles.

[5]  Jiaya Jia,et al.  Unifying Voxel-based Representation with Transformer for 3D Object Detection , 2022, NeurIPS.

[6]  Huizi Mao,et al.  BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Jiaya Jia,et al.  Focal Sparse Convolutional Networks for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Florian Meyer,et al.  Neural Enhanced Belief Propagation for Data Association in Multiobject Tracking , 2022, 2022 25th International Conference on Information Fusion (FUSION).

[10]  Yunhao Du,et al.  StrongSORT: Make DeepSORT Great Again , 2022, IEEE Transactions on Multimedia.

[11]  Yuntao Chen,et al.  Immortal Tracker: Tracklet Never Dies , 2021, ArXiv.

[12]  Ziqi Pang,et al.  SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking , 2021, ECCV Workshops.

[13]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[14]  Andreas Zell,et al.  Score refinement for confidence-based 3D multi-object tracking , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Laura Leal-Taixé,et al.  EagerMOT: 3D Multi-Object Tracking via Sensor Fusion , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Luc Van Gool,et al.  Learnable Online Graph Representations for 3D Multi-Object Tracking , 2021, IEEE Robotics and Automation Letters.

[17]  Haibin Ling,et al.  TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking , 2021, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[18]  Hujun Bao,et al.  LoFTR: Detector-Free Local Feature Matching with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  M. Trivedi,et al.  TrackMPNN: A Message Passing Graph Neural Architecture for Multi-Object Tracking , 2021, ArXiv.

[20]  Jeannette Bohg,et al.  Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[25]  Daniel Cremers,et al.  MOT20: A benchmark for multi object tracking in crowded scenes , 2020, ArXiv.

[26]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[27]  Jie Li,et al.  Probabilistic 3D Multi-Object Tracking for Autonomous Driving , 2020, ArXiv.

[28]  L. Leal-Taix'e,et al.  Learning a Neural Solver for Multiple Object Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Benjin Zhu,et al.  Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[31]  David Held,et al.  3D Multi-Object Tracking: A Baseline and New Evaluation Metrics , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[35]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[39]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[40]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[41]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[42]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ramakant Nevatia,et al.  Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[46]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[47]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[48]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[49]  Zequn Jie,et al.  MSMDFusion: A Gated Multi-Scale LiDAR-Camera Fusion Framework with Multi-Depth Seeds for 3D Object Detection , 2022 .

[50]  Mohammed J. Zaki,et al.  Edge-augmented Graph Transformers: Global Self-attention is Enough for Graphs , 2021, ArXiv.

[51]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[52]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..