How to Train Your Deep Multi-Object Tracker

The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multi-object tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multi-object trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark. Our code is publicly available from https://github.com/yihongXU/deepMOT.

[1]  Wongun Choi,et al.  Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  James M. Rehg,et al.  Multiple Hypothesis Tracking Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[5]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[6]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Long Chen,et al.  Online multi-object tracking with convolutional neural networks , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[9]  Volker Eiselein,et al.  Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[10]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[11]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[12]  Luc Van Gool,et al.  Coupled Object Detection and Tracking from Static Cameras and Moving Vehicles , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Charless C. Fowlkes,et al.  Learning Optimal Parameters for Multi-target Tracking with Contextual Interactions , 2016, International Journal of Computer Vision.

[14]  Luc Van Gool,et al.  Customized Multi-person Tracker , 2018, ACCV.

[15]  Konrad Schindler,et al.  Online Multi-Target Tracking Using Recurrent Neural Networks , 2016, AAAI.

[16]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[19]  Volker Eiselein,et al.  High-Speed tracking-by-detection without using image information , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[20]  Bodo Rosenhahn,et al.  Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking , 2017, ArXiv.

[21]  Ian D. Reid,et al.  Data-Driven Approximations to NP-Hard Problems , 2017, AAAI.

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Yang Zhang,et al.  Iterative Multiple Hypothesis Tracking With Tracklet-Level Association , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Bohyung Han,et al.  Multi-object Tracking with Quadruplet Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Mohamed R. Amer,et al.  Multiobject tracking as maximum weight independent set , 2011, CVPR 2011.

[27]  Yang Zhang,et al.  Heterogeneous Association Graph Fusion for Target Association in Multiple Object Tracking , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Konrad Schindler,et al.  Learning by Tracking: Siamese CNN for Robust Target Association , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Ramakant Nevatia,et al.  Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Silvio Savarese,et al.  Recurrent Autoregressive Networks for Online Multi-object Tracking , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Haibin Ling,et al.  FAMNet: Joint Learning of Feature, Affinity and Multi-Dimensional Assignment for Online Multiple Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[35]  Robert T. Collins,et al.  Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Silvio Savarese,et al.  Data-driven 3D Voxel Patterns for object category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Nenghai Yu,et al.  Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Konrad Schindler,et al.  Continuous Energy Minimization for Multitarget Tracking , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Euntai Kim,et al.  Multiple Object Tracking via Feature Pyramid Siamese Networks , 2019, IEEE Access.

[42]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[43]  Thomas Brox,et al.  Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Radu Horaud,et al.  Tracking Multiple Persons Based on a Variational Bayesian Model , 2016, ECCV Workshops.

[45]  Fan Yang,et al.  Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Bastian Leibe,et al.  Combined image- and world-space tracking in traffic scenes , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Radu Horaud,et al.  Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[48]  Fan Yang,et al.  Trajectory Factory: Tracklet Cleaving and Re-Connection by Deep Siamese Bi-GRU for Multiple Object Tracking , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[49]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Yang Zhang,et al.  Enhancing Detection Model for Multiple Hypothesis Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[52]  Wongun Choi,et al.  Deep Network Flow for Multi-object Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Radu Horaud,et al.  An on-line variational Bayesian model for multi-person tracking from cluttered scenes , 2016, Comput. Vis. Image Underst..

[56]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[57]  Bodo Rosenhahn,et al.  Fusion of Head and Full-Body Detectors for Multi-object Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[58]  Bernt Schiele,et al.  Multiple People Tracking by Lifted Multicut and Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[60]  S. Shankar Sastry,et al.  Markov Chain Monte Carlo Data Association for Multi-Target Tracking , 2009, IEEE Transactions on Automatic Control.

[61]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[62]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[63]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  B. V. K. Vijaya Kumar,et al.  A multi-sensor fusion system for moving object detection and tracking in urban driving environments , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[66]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[67]  James M. Rehg,et al.  Multi-object Tracking with Neural Gating Using Bilinear LSTM , 2018, ECCV.

[68]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[69]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[70]  Radu Horaud,et al.  Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  H. Kuhn The Hungarian method for the assignment problem , 1955 .