Deep Occlusion Reasoning for Multi-camera Multi-target Detection

People detection in single 2D images has improved greatly in recent years. However, comparatively little of this progress has percolated into multi-camera multipeople tracking algorithms, whose performance still degrades severely when scenes become very crowded. In this work, we introduce a new architecture that combines Convolutional Neural Nets and Conditional Random Fields to explicitly model those ambiguities. One of its key ingredients are high-order CRF terms that model potential occlusions and give our approach its robustness even when many people are present. Our model is trained end-to-end and we show that it outperforms several state-of-the-art algorithms on challenging scenes.

[1]  Philip H. S. Torr,et al.  Higher Order Potentials in End-to-End Trainable Conditional Random Fields , 2015, ArXiv.

[2]  Tatjana Chavdarova,et al.  Deep Multi-camera People Detection , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[9]  Philip H. S. Torr,et al.  Learning Arbitrary Potentials in CRFs with Gradient Descent , 2017, ArXiv.

[10]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[13]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[14]  Luc Van Gool,et al.  The WILDTRACK Multi-Camera Person Dataset , 2017, ArXiv.

[15]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[17]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[18]  Vibhav Vineet,et al.  Filter-Based Mean-Field Inference for Random Fields with Higher-Order Terms and Product Label-Spaces , 2012, International Journal of Computer Vision.

[19]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[20]  Carsten Rother,et al.  Joint Training of Generic CNN-CRF Models with Stochastic Optimization , 2016, ACCV.

[21]  Andrea Cavallaro,et al.  Image Analysis for Video Surveillance Based on Spatial Regularization of a Statistical Model-Based Change Detection , 2001, Real Time Imaging.

[22]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Pascal Fua,et al.  Multi-modal Mean-Fields via Cardinality-Based Clamping , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yonghong Tian,et al.  Robust multiple cameras pedestrian detection with multi-view Bayesian network , 2015, Pattern Recognit..

[26]  Justin Domke,et al.  Learning Graphical Model Parameters with Approximate Marginal Inference , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Pushmeet Kohli,et al.  Higher-Order Models in Computer Vision , 2012 .

[28]  B. Schiele,et al.  How Far are We from Solving Pedestrian Detection? , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yannick Boursier,et al.  Sparsity Driven People Localization with a Heterogeneous Network of Cameras , 2011, Journal of Mathematical Imaging and Vision.

[31]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Pascal Fua,et al.  Principled Parallel Mean-Field Inference for Discrete Random Fields , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  ZhangJing,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video , 2009 .

[34]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .