Complementary-View Co-Interest Person Detection

Fast and accurate identification of the co-interest persons, who draw joint interest of the surrounding people, plays an important role in social scene understanding and surveillance. Previous study mainly focuses on detecting co-interest persons from a single-view video. In this paper, we study a much more realistic and challenging problem, namely co-interest person~(CIP) detection from multiple temporally-synchronized videos taken by the complementary and time-varying views. Specifically, we use a top-view camera, mounted on a flying drone at a high altitude to obtain a global view of the whole scene and all subjects on the ground, and multiple horizontal-view cameras, worn by selected subjects, to obtain a local view of their nearby persons and environment details. We present an efficient top- and horizontal-view data fusion strategy to map multiple horizontal views into the global top view. We then propose a spatial-temporal CIP potential energy function that jointly considers both intra-frame confidence and inter-frame consistency, thus leading to an effective Conditional Random Field~(CRF) formulation. We also construct a complementary-view video dataset, which provides a benchmark for the study of multi-view co-interest person detection. Extensive experiments validate the effectiveness and superiority of the proposed method.

[1]  Song-Chun Zhu,et al.  Cross-View People Tracking by Scene-Centered Spatio-Temporal Parsing , 2017, AAAI.

[2]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Wojciech Matusik,et al.  Gaze360: Physically Unconstrained Gaze Estimation in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Yoichi Sato,et al.  Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[7]  Hujun Bao,et al.  Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ali Borji,et al.  Integrating Egocentric Videos in Top-View Surveillance Videos: Joint Identification and Temporal Alignment , 2018, ECCV.

[9]  Qingming Huang,et al.  Video Saliency Detection via Sparsity-Based Reconstruction and Propagation , 2019, IEEE Transactions on Image Processing.

[10]  Jean-Marc Odobez,et al.  Tracking the Visual Focus of Attention for a Varying Number of Wandering People , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Antonio Torralba,et al.  Following Gaze in Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Wojciech Matusik,et al.  Eye Tracking for Everyone , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Wei Feng,et al.  Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras , 2020, ACM Multimedia.

[14]  Jiewen Zhao,et al.  Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions , 2019, ArXiv.

[15]  James M. Rehg,et al.  Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency , 2018, ECCV.

[16]  Hong Qin,et al.  Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion , 2017, IEEE Transactions on Image Processing.

[17]  Mubarak Shah,et al.  Video Object Co-segmentation by Regulated Maximum Weight Cliques , 2014, ECCV.

[18]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[19]  Thomas Deselaers,et al.  Weakly Supervised Localization and Learning with Generic Knowledge , 2012, International Journal of Computer Vision.

[20]  Yaser Sheikh,et al.  3D Social Saliency from Head-mounted Cameras , 2012, NIPS.

[21]  Ali Borji,et al.  Ego2Top: Matching Viewers in Egocentric and Top-View Videos , 2016, ECCV.

[22]  Song-Chun Zhu,et al.  Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Song-Chun Zhu,et al.  Inferring Shared Attention in Social Scene Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Ali Borji,et al.  Egocentric Meets Top-View , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Fei-Fei Li,et al.  Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[26]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[27]  Youjie Zhou,et al.  Co-Interest Person Detection from Multiple Wearable Camera Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[29]  Haibin Ling,et al.  Salient Object Detection in the Deep Learning Era: An In-Depth Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jiewen Zhao,et al.  Complementary-View Multiple Human Tracking , 2020, AAAI.

[31]  Mario Fritz,et al.  Multi-class Video Co-segmentation with a Generative Multi-video Model , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Long-Wen Chang,et al.  Video object cosegmentation , 2012, ACM Multimedia.

[33]  Linwei Ye,et al.  Video co-saliency detection , 2016, International Conference on Digital Image Processing.

[34]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[35]  Zhuwen Li,et al.  Video Co-segmentation for Meaningful Action Extraction , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Stephen Lin,et al.  Object-Based Multiple Foreground Video Co-segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Yung-Yu Chuang,et al.  FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ian D. Reid,et al.  Estimating Gaze Direction from Low-Resolution Faces in Video , 2006, ECCV.

[39]  Michael S. Ryoo,et al.  Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos , 2018, ECCV.

[40]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).