Transductive Multi-Object Tracking in Complex Events by Interactive Self-Training

Recently, multi-object tracking (MOT) for estimating trajectories of pedestrians has undergone fast development and played an important role in human-centric video analysis. However, video analysis in complex events (e.g. scenes in HiEve dataset) is still under-explored. In complex real-world scenarios, domain gap in unseen testing scenes and severe occlusion problem that disconnects tracks are challenging for existing online MOT methods without domain adaptation. To alleviate domain gap, we study the problem in a transductive learning setting, which assumes that unlabeled testing data is available for learning offline tracking. We propose a transductive interactive self-training method to adapt the tracking model to unseen crowded scenes with unlabeled testing data by means of teacher-student interative learning. To reduce prediction variance in an unseen domain, we train two different models and teach one model with pseudo labels of unlabeled data predicted by the other model interactively. To improve robustness against occlusions during self-training, we exploit disconnected track interpolation (DTI) to refine the predicted pseudo labels. Our method achieved MOTA of 60.23 on HiEve dataset and won the first place of Multi-person Motion Tracking in Complex Events (with Private Detection) in the ACM MM Grand Challenge on Large-scale Human-centric Video Analysis in Complex Events.

[1]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xiangyu Zhang,et al.  CrowdHuman: A Benchmark for Detecting Human in a Crowd , 2018, ArXiv.

[3]  Alan Yuille,et al.  DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution , 2020, ArXiv.

[4]  Junliang Xing,et al.  Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[5]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[6]  Xinggang Wang,et al.  A Simple Baseline for Multi-Object Tracking , 2020, ArXiv.

[7]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[8]  Liang Zheng,et al.  Towards Real-Time Multi-Object Tracking , 2020, ECCV.

[9]  Yu Liu,et al.  POI: Multiple Object Tracking with High Performance Detection and Appearance Feature , 2016, ECCV Workshops.

[10]  Longhui Wei,et al.  Person Transfer GAN to Bridge Domain Gap for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Ramakant Nevatia,et al.  Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Silvio Savarese,et al.  Recurrent Autoregressive Networks for Online Multi-object Tracking , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Daniel Cremers,et al.  MOT20: A benchmark for multi object tracking in crowded scenes , 2020, ArXiv.

[14]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Volker Eiselein,et al.  High-Speed tracking-by-detection without using image information , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[16]  Qi Tian,et al.  Person Re-identification in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[20]  Xiaogang Wang,et al.  End-to-End Deep Learning for Person Search , 2016, ArXiv.

[21]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[22]  Mohammad Rahmati,et al.  Multi-target tracking using CNN-based features: CNNMTT , 2018, Multimedia Tools and Applications.

[23]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[24]  Long Chen,et al.  Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Nicu Sebe,et al.  Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events , 2020, ArXiv.

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Wei Jiang,et al.  Bag of Tricks and a Strong Baseline for Deep Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[30]  Eleonora Vig,et al.  Online Domain Adaptation for Multi-Object Tracking , 2015, BMVC.

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  Bernt Schiele,et al.  CityPersons: A Diverse Dataset for Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.