A Stochastic Attribute Grammar for Robust Cross-View Human Tracking

In computer vision, tracking humans across camera views remain challenging, especially for complex scenarios with frequent occlusions, significant lighting changes, and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to distinguish humans across camera views. To address these challenges, this paper presents a stochastic attribute grammar model for leveraging complementary and discriminative human attributes for enhancing cross-view tracking. The key idea of our method is to introduce a hierarchical representation, parse graph, to describe a subject and its movement trajectory in both space and time domains. These results in a hierarchical compositional representation, comprising trajectory entities of varying level, including human boxes, 3D human boxes, tracklets, and trajectories. We use a set of grammar rules to decompose a graph node (e.g., tracklet) into a set of children nodes (e.g., 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed and direction), accessories (e.g., bags), and/or activities (e.g., walking and running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. In particular, the attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. To infer the optimal parse graph and its attributes, we develop an alternative parsing method that employs both top-down and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks, and demonstrate with extensive experiments that our method clearly outperforms the state-of-the-art tracking methods.

[1]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[2]  Hai Jin,et al.  Adaptive Object Tracking by Learning Hybrid Template Online , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Binlong Li,et al.  Dynamic subspace-based coordinated multicamera tracking , 2011, 2011 International Conference on Computer Vision.

[4]  Liang Lin,et al.  Contextualized trajectory parsing with spatiotemporal graph. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[5]  Gang Wang,et al.  Tracklet Association with Online Target-Specific Metric Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Liang Lin,et al.  Human Re-identification by Matching Compositional Template with Cluster Sampling , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Xiaobai Liu,et al.  Multi-View 3D Human Tracking in Crowded Scenes , 2016, AAAI.

[8]  Chang‐Jin Kim,et al.  State Space Models with Regime Switching , 1999 .

[9]  Ramakant Nevatia,et al.  An online learned CRF model for multi-target tracking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Konrad Schindler,et al.  Detection- and Trajectory-Level Exclusion in Multiple Object Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Rynson W. H. Lau,et al.  Visual Tracking via Locality Sensitive Histograms , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mubarak Shah,et al.  A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint , 2006, ECCV.

[14]  Nanning Zheng,et al.  Stereo Matching Using Belief Propagation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Thomas Mauthner,et al.  Occlusion Geodesics for Online Multi-object Tracking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Ying Wu,et al.  What Are We Tracking: A Unified Approach of Tracking and Recognition , 2013, IEEE Transactions on Image Processing.

[17]  Andrew Zisserman,et al.  Feature Based Methods for Structure and Motion Estimation , 1999, Workshop on Vision Algorithms.

[18]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ramakant Nevatia,et al.  Global data association for multi-object tracking using network flows , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Song-Chun Zhu,et al.  Single-View 3D Scene Parsing by Attributed Grammar , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Stan Z. Li,et al.  A Probabilistic Framework for Multitarget Tracking with Mutual Occlusions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ramakant Nevatia,et al.  How does person identity recognition help multi-person tracking? , 2011, CVPR 2011.

[24]  Gérard G. Medioni,et al.  Multiple Target Tracking Using Spatio-Temporal Markov Chain Monte Carlo Data Association , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Song-Chun Zhu,et al.  Attributed Grammars for Joint Estimation of Human Attributes, Part and Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Adrian Barbu,et al.  Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Zhuowen Tu,et al.  Image Segmentation by Data-Driven Markov Chain Monte Carlo , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Rui Caseiro,et al.  Globally optimal solution to multi-object tracking with merged measurements , 2011, 2011 International Conference on Computer Vision.

[30]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ming Yang,et al.  Spatial selection for attentional visual tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[34]  Margrit Betke,et al.  Tracking a large number of objects from multiple views , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  D. Kimura Dual functional asymmetry of the brain in visual perception , 1966 .

[37]  Song-Chun Zhu,et al.  Human Attribute Recognition by Rich Appearance Dictionary , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[41]  Song-Chun Zhu,et al.  Cross-View People Tracking by Scene-Centered Spatio-Temporal Parsing , 2017, AAAI.

[42]  Konrad Schindler,et al.  Multi-target tracking by continuous energy minimization , 2011, CVPR 2011.

[43]  Song-Chun Zhu,et al.  C^4: Exploring Multiple Solutions in Graphical Models by Cluster Sampling , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Li-Chen Fu,et al.  On-line human action recognition by combining joint tracking and key pose recognition , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[45]  Pascal Fua,et al.  Multi-Commodity Network Flow for Tracking Multiple People , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  James J. Little,et al.  A Linear Programming Approach for Multiple Object Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[48]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[49]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[50]  Ákos Utasi,et al.  A 3-D marked point process model for multi-view people detection , 2011, CVPR 2011.

[51]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .