Self-supervised Multi-view Multi-Human Association and Tracking

Multi-view Multi-human association and tracking (MvMHAT) aims to track a group of people over time in each view, as well as to identify the same person across different views at the same time. This is a relatively new problem but is very important for multi-person scene video surveillance. Different from previous multiple object tracking (MOT) and multi-target multi-camera tracking (MTMCT) tasks, which only consider the over-time human association, MvMHAT requires to jointly achieve both cross-view and over-time data association. In this paper, we model this problem with a self-supervised learning framework and leverage an end-to-end network to tackle it. Specifically, we propose a spatial-temporal association network with two designed self-supervised learning losses, including a symmetric-similarity loss and a transitive-similarity loss, at each time to associate the multiple humans over time and across views. Besides, to promote the research on MvMHAT, we build a new large-scale benchmark for the training and testing of different algorithms. Extensive experiments on the proposed benchmark verify the effectiveness of our method. We have released the benchmark and code to the public.

[1]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Gérard G. Medioni,et al.  Exploring context information for inter-camera multiple target tracking , 2014, IEEE Winter Conference on Applications of Computer Vision.

[3]  Wei Feng,et al.  Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras , 2020, ACM Multimedia.

[4]  Shengcai Liao,et al.  Unsupervised Graph Association for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Yue Cao,et al.  Spatial-Temporal Relation Networks for Multi-Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Jenq-Neng Hwang,et al.  Exploit the Connectivity: Multi-Object Tracking with TrackletNet , 2018, ACM Multimedia.

[7]  Hujun Bao,et al.  Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Liang Zheng,et al.  CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions , 2020, ECCV.

[9]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[12]  Margret Keuper,et al.  Unsupervised Multiple Person Tracking using AutoEncoder-Based Lifted Multicuts , 2020, ArXiv.

[13]  Afshin Dehghan,et al.  GMCP-Tracker: Global Multi-object Tracking Using Generalized Minimum Clique Graphs , 2012, ECCV.

[14]  BernardinKeni,et al.  Evaluating multiple object tracking performance , 2008 .

[15]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[16]  Marcello Pelillo,et al.  Multi-target Tracking in Multiple Non-overlapping Cameras Using Fast-Constrained Dominant Sets , 2019, International Journal of Computer Vision.

[17]  Zihang Lai,et al.  Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Ramakant Nevatia,et al.  An online learned CRF model for multi-target tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[21]  Song Wang,et al.  Multiple Human Association and Tracking From Egocentric and Complementary Top Views , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Shaogang Gong,et al.  Unsupervised Tracklet Person Re-Identification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xavier Alameda-Pineda,et al.  How to Train Your Deep Multi-Object Tracker , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Francisco Herrera,et al.  Deep Learning in Video Multi-Object Tracking: A Survey , 2019, Neurocomputing.

[26]  Wei Feng,et al.  Complementary-View Co-Interest Person Detection , 2020, ACM Multimedia.

[27]  Simone Calderara,et al.  Visual Tracking: An Experimental Survey , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xu Gao,et al.  OSMO: Online Specific Models for Occlusion in Multiple Object Tracking under Surveillance Scene , 2018, ACM Multimedia.

[29]  Carlo Tomasi,et al.  Features for Multi-target Multi-camera Tracking and Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Mubarak Shah,et al.  A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint , 2006, ECCV.

[31]  Daniel Wolf,et al.  Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Yael Moses,et al.  Tracking in a Dense Crowd Using Multiple Cameras , 2010, International Journal of Computer Vision.

[33]  Ameya Prabhu,et al.  Simple Unsupervised Multi-Object Tracking , 2020, ArXiv.

[34]  Ramakant Nevatia,et al.  Multi-target tracking by online learning of non-linear motion patterns and robust appearance models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Vladlen Koltun,et al.  Tracking Objects as Points , 2020, ECCV.

[36]  Shaogang Gong,et al.  Unsupervised Person Re-identification by Deep Learning Tracklet Association , 2018, ECCV.

[37]  Philip H. S. Torr,et al.  HOTA: A Higher Order Metric for Evaluating Multi-object Tracking , 2020, International Journal of Computer Vision.

[38]  OSMO , 2018, Proceedings of the 26th ACM international conference on Multimedia.

[39]  Song Wang,et al.  Multiple Human Tracking in Non-Specific Coverage with Wearable Cameras , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[41]  Binlong Li,et al.  Dynamic subspace-based coordinated multicamera tracking , 2011, 2011 International Conference on Computer Vision.

[42]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Haibin Ling,et al.  FAMNet: Joint Learning of Feature, Affinity and Multi-Dimensional Assignment for Online Multiple Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Song-Chun Zhu,et al.  Cross-View People Tracking by Scene-Centered Spatio-Temporal Parsing , 2017, AAAI.

[45]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[46]  Feiyue Huang,et al.  Dense Scene Multiple Object Tracking with Box-Plane Matching , 2020, ACM Multimedia.

[47]  Shaogang Gong,et al.  Multi-camera Matching using Bi-Directional Cumulative Brightness Transfer Functions , 2008, BMVC.

[48]  Francesco Solera,et al.  Performance Measures and a Data Set for Multi-target, Multi-camera Tracking , 2016, ECCV Workshops.

[49]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Yadong Mu,et al.  A Stochastic Attribute Grammar for Robust Cross-View Human Tracking , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[53]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[54]  Tieniu Tan,et al.  Object tracking across non-overlapping views by learning inter-camera transfer models , 2014, Pattern Recognit..

[55]  Jiewen Zhao,et al.  Complementary-View Multiple Human Tracking , 2020, AAAI.

[56]  Junsong Yuan,et al.  Track to Detect and Segment: An Online Multi-Object Tracker , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jiewen Zhao,et al.  Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions , 2019, ArXiv.

[58]  Andrew Gilbert,et al.  Tracking Objects Across Cameras by Incrementally Learning Inter-camera Colour Calibration and Patterns of Activity , 2006, ECCV.

[59]  Liang Zheng,et al.  Locality Aware Appearance Metric for Multi-Target Multi-Camera Tracking , 2019, ArXiv.

[60]  Afshin Dehghan,et al.  GMMCP tracker: Globally optimal Generalized Maximum Multi Clique problem for multiple object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Hao Guo,et al.  Learning View-Invariant Features for Person Identification in Temporally Synchronized Videos Taken by Wearable Cameras , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).