TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent years. Despite this wave, multiple-object tracking (MOT) exhibits for now some sort of incompatibility with transformers. We argue that the standard representation – bounding boxes – is not adapted to learning transformers for MOT. Inspired by recent research, we propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets. Methodologically, we propose the use of dense queries in a double-decoder network, to be able to robustly infer the heatmap of targets’ centers and associate them through time. TransCenter outperforms the current state-of-the-art in multiple-object tracking, both in MOT17 and MOT20. Our ablation study demonstrates the advantage in the proposed architecture compared to more naive alternatives. The code will be made publicly available.

[1]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[2]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  B. Schiele,et al.  Pedestrian detection: A benchmark , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Chang Huang,et al.  Learning to associate: HybridBoosted multi-target tracker for crowded scene , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Bernt Schiele,et al.  Subgraph decomposition for multi-target tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ian D. Reid,et al.  Joint Probabilistic Data Association Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Radu Horaud,et al.  Tracking Multiple Persons Based on a Variational Bayesian Model , 2016, ECCV Workshops.

[15]  Thomas Brox,et al.  A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects , 2016, ArXiv.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[18]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  Bernt Schiele,et al.  Multi-person Tracking by Multicut and Deep Matching , 2016, ECCV Workshops.

[20]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[21]  Bernt Schiele,et al.  CityPersons: A Diverse Dataset for Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Konrad Schindler,et al.  Online Multi-Target Tracking Using Recurrent Neural Networks , 2016, AAAI.

[26]  Santiago Manen,et al.  PathTrack: Fast Trajectory Annotation with Path Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Bernt Schiele,et al.  Multiple People Tracking by Lifted Multicut and Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Qi Tian,et al.  Person Re-identification in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiaogang Wang,et al.  Joint Detection and Identification Feature Learning for Person Search , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xiangyu Zhang,et al.  CrowdHuman: A Benchmark for Detecting Human in a Crowd , 2018, ArXiv.

[31]  Xiaoou Tang,et al.  LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[33]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[34]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Nathanael L. Baisa Online Multi-object Visual Tracking using a GM-PHD Filter with Deep Appearance Learning , 2019, 2019 22th International Conference on Information Fusion (FUSION).

[36]  Zhenan Sun,et al.  Foreground-Aware Pyramid Reconstruction for Alignment-Free Occluded Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[39]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[40]  Wu Liu,et al.  Guided Saliency Feature Learning for Person Re-identification in Crowded Scenes , 2020, ECCV.

[41]  Kris M. Kitani,et al.  Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling , 2020, ArXiv.

[42]  R. Horaud,et al.  How to Train Your Deep Multi-Object Tracker , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chuang Gan,et al.  Foley Music: Learning to Generate Music from Videos , 2020, ECCV.

[44]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[45]  Zhang Xiong,et al.  Multiplex Labeling Graph for Near-Online Tracking in Crowded Scenes , 2020, IEEE Internet of Things Journal.

[46]  L. Leal-Taix'e,et al.  Learning a Neural Solver for Multiple Object Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  A. Karpatne,et al.  GCNNMatch: Graph Convolutional Neural Networks for Multi-Object Tracking via Sinkhorn Normalization , 2020, arXiv.org.

[48]  Bin Liu,et al.  GSM: Graph Similarity Model for Multi-Object Tracking , 2020, IJCAI.

[49]  Xiansheng Hua,et al.  Tracklets Predicting Based Adaptive Graph Tracking , 2020 .

[50]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[51]  Bodo Rosenhahn,et al.  Lifted Disjoint Paths with Application in Multiple Object Tracking , 2020, ICML.

[52]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Ameya Prabhu,et al.  Simple Unsupervised Multi-Object Tracking , 2020, ArXiv.

[55]  Feiyue Huang,et al.  Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking , 2020, ECCV.

[56]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Daniel Cremers,et al.  MOT20: A benchmark for multi object tracking in crowded scenes , 2020, ArXiv.

[58]  Cewu Lu,et al.  TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Junjie Yan,et al.  DETR for Pedestrian Detection , 2020, ArXiv.

[60]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[62]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[63]  Mathieu Salzmann,et al.  Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Davide Modolo,et al.  SiamMOT: Siamese Multi-Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Trevor Darrell,et al.  Quasi-Dense Similarity Learning for Multiple Object Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[67]  Hanqing Lu,et al.  Improving Multiple Object Tracking with Single Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Junsong Yuan,et al.  Track to Detect and Segment: An Online Multi-Object Tracker , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Yinghui Xu,et al.  Multiple Object Tracking with Correlation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[71]  Dacheng Tao,et al.  Online Multiple Object Tracking with Cross-Task Synergy , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[73]  Zehao Huang,et al.  Learnable Graph Matching: Incorporating Graph Partitioning with Deep Feature Learning for Multiple Object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Kris Kitani,et al.  Joint Object Detection and Multi-Object Tracking with Graph Neural Networks , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[75]  Chanho Kim,et al.  Discriminative Appearance Modeling with Multi-track Pooling for Real-time Multi-object Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Jun Zhao,et al.  MAT: Motion-Aware Multi-Object Tracking , 2020, Neurocomputing.

[77]  Jianxiao Zou,et al.  Rethinking the competition between detection and ReID in Multi-Object Tracking , 2020, ArXiv.