Track to Detect and Segment: An Online Multi-Object Tracker

Most online multi-object trackers perform object detection stand-alone in a neural net without any input from tracking. In this paper, we present a new online joint detection and tracking model, TraDeS (TRAck to DEtect and Segment), exploiting tracking clues to assist detection end-to-end. TraDeS infers object tracking offset by a cost volume, which is used to propagate previous object features for improving current object detection and segmentation. Effectiveness and superiority of TraDeS are shown on 4 datasets, including MOT (2D tracking), nuScenes (3D tracking), MOTS and Youtube-VIS (instance segmentation tracking). Project page: https://jialianwu.com/projects/TraDeS.html.

[1]  Wei Zhang,et al.  Segment as Points for Efficient Online Multi-Object Tracking and Segmentation , 2020, ECCV.

[2]  Stephen Lin,et al.  DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[3]  L. Leal-Taixé,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Junliang Xing,et al.  Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[5]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[6]  Jia Xu,et al.  Accurate Optical Flow via Direct Cost Volume Processing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jianbo Shi,et al.  Learning Temporal Pose Estimation from Sparsely-Labeled Videos , 2019, NeurIPS.

[8]  Junjie Yan,et al.  Multiple Target Tracking Based on Undirected Hierarchical Relation Hypergraph , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  J. Serrat,et al.  Learning Multi-Object Tracking and Segmentation From Automatic Annotations , 2019, Computer Vision and Pattern Recognition.

[10]  Ming Yang,et al.  Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[12]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Qing Zhao,et al.  Multi-Object Tracking Using Online Metric Learning with Long Short-Term Memory , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  David Held,et al.  3D Multi-Object Tracking: A Baseline and New Evaluation Metrics , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Davide Modolo,et al.  Combining Detection and Tracking for Human Pose Estimation in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Houqiang Li,et al.  Single Shot Video Object Detector , 2020, IEEE Transactions on Multimedia.

[22]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Songtao Liu,et al.  Learning Spatial Fusion for Single-Shot Object Detection , 2019, ArXiv.

[24]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Stephen Lin,et al.  Integrated Object Detection and Tracking with Tracklet-Conditioned Detection , 2018, ArXiv.

[26]  Jianbo Shi,et al.  Object Detection in Video with Spatiotemporal Sampling Networks , 2018, ECCV.

[27]  James M. Rehg,et al.  Multiple Hypothesis Tracking Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[29]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[30]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[31]  L. Leal-Taix'e,et al.  Learning a Neural Solver for Multiple Object Tracking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Bernt Schiele,et al.  Multiple People Tracking by Lifted Multicut and Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  J. Álvarez,et al.  Cost Volume Pyramid Based Depth Inference for Multi-View Stereo , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[37]  Silvio Savarese,et al.  Recurrent Autoregressive Networks for Online Multi-object Tracking , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[39]  Wongun Choi,et al.  Deep Network Flow for Multi-object Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[41]  Andrea Simonelli,et al.  Disentangling Monocular 3D Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Mubarak Shah,et al.  Deep Affinity Network for Multiple Object Tracking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ruigang Yang,et al.  A Unified Object Motion and Affinity Model for Online Multi-Object Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Shengjin Wang,et al.  Towards Real-Time Multi-Object Tracking , 2019, ECCV.

[46]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[47]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[48]  Fahad Shahbaz Khan,et al.  SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation , 2020, ECCV.

[49]  Kris Kitani,et al.  GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yu Liu,et al.  POI: Multiple Object Tracking with High Performance Detection and Appearance Feature , 2016, ECCV Workshops.

[51]  Feiyue Huang,et al.  Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking , 2020, ECCV.

[52]  Zhichao Lu,et al.  RetinaTrack: Online Single Stage Joint Detection and Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Cewu Lu,et al.  TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Kris Kitani,et al.  Joint Object Detection and Multi-Object Tracking with Graph Neural Networks , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[57]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Jian Yang,et al.  Person Search via A Mask-Guided Two-Stream CNN Model , 2018, ECCV.

[59]  Vladlen Koltun,et al.  Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Robert T. Collins,et al.  A space-sweep approach to true multi-image matching , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[61]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Xiaogang Wang,et al.  Joint Detection and Identification Feature Learning for Person Search , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[64]  Enkhbayar Erdenee,et al.  Multi-class Multi-object Tracking Using Changing Point Detection , 2016, ECCV Workshops.