Bringing Generalization to Deep Multi-view Detection

Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we propose a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and propose a barebones model to incorporate them. We perform a comprehensive set of experiments on the WildTrack, MultiViewX and the GMVD datasets to motivate the necessity to evaluate generalization abilities of MVD methods and to demonstrate the efficacy of the proposed approach. The code and the proposed dataset can be found at https://github.com/jeetv/GMVD

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[4]  Xiaogang Wang,et al.  Partial Occlusion Handling in Pedestrian Detection With a Deep Model , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Vineet Gandhi,et al.  ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Junsong Yuan,et al.  Stacked Homography Transformations for Multi-View Pedestrian Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Stephen Gould,et al.  Multiview Detection with Feature Perspective Transformation , 2020, ECCV.

[9]  Yuning Jiang,et al.  Repulsion Loss: Detecting Pedestrians in a Crowd , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[11]  Gunhee Kim,et al.  Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Yuning Jiang,et al.  FoveaBox: Beyound Anchor-Based Object Detection , 2019, IEEE Transactions on Image Processing.

[13]  Paolo Favaro,et al.  Self-Supervised Feature Learning by Learning to Spot Artifacts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Sven Nordholm,et al.  A Bayesian Filter for Multi-View 3D Multi-Object Tracking With Occlusion Handling , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[17]  Shifeng Zhang,et al.  Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd , 2018, ECCV.

[18]  Vineet Gandhi,et al.  Tidying Deep Saliency Prediction Architectures , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  Pascal Fua,et al.  Deep Occlusion Reasoning for Multi-camera Multi-target Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Pascal Fua,et al.  Conditional Random Fields for multi-camera object detection , 2011, 2011 International Conference on Computer Vision.

[22]  Jesús Bescós,et al.  Semantic Driven Multi-Camera Pedestrian Detection , 2018, ArXiv.

[23]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  L. Gool,et al.  mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Veronica Teichrieb,et al.  Generalizable Multi-Camera 3D Pedestrian Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[27]  Pascal Fua,et al.  Tracking multiple people under global appearance constraints , 2011, 2011 International Conference on Computer Vision.

[28]  Chunluan Zhou,et al.  Multi-label Learning of Part Detectors for Heavily Occluded Pedestrian Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Nicu Sebe,et al.  Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Liang Zheng,et al.  Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation) , 2021, ACM Multimedia.

[32]  Yannick Boursier,et al.  Sparsity Driven People Localization with a Heterogeneous Network of Cameras , 2011, Journal of Mathematical Imaging and Vision.

[33]  Yuning Jiang,et al.  FoveaBox: Beyond Anchor-based Object Detector , 2019, ArXiv.

[34]  Luc Van Gool,et al.  WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Zihang Lai,et al.  Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[36]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Qi Tian,et al.  CenterNet: Keypoint Triplets for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Tatjana Chavdarova,et al.  Deep Multi-camera People Detection , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[41]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.