Towards Causality Inference for Very Important Person Localization

Very Important Person Localization (VIPLoc) aims at detecting certain individuals in a given image, who are more attractive than others in the image. Existing uncontrolled VIPLoc benchmark assumes that the image has one single VIP, which is not suitable for actual application scenarios when multiple VIPs or no VIPs appear in the image. In this paper, we re-built a complex uncontrolled conditions (CUC) dataset to make the VIPLoc closer to the actual situation, containing no, single, and multiple VIPs. Existing methods use the hand-designed and deep learning strategies to extract the features of persons and analyze the differences between VIPs and other persons from the perspective of statistics. They are not explainable as to why the VIP located this output for that input. Thus, there exist the severe performance degradation when we use these models in real-world VIPLoc. Specifically, we establish a causal inference framework that unpacks the causes of previous methods and derives a new principled solution for VIPLoc. It treats the scene as confounding factor, allowing the ever-elusive confounding effects to be eliminated and the essential determinants to be uncovered. Through extensive experiments, our method outperforms the state-of-the-art methods on public VIPLoc datasets and the re-built CUC dataset.

[1]  Michael J. Black,et al.  Putting People in their Place: Monocular Regression of 3D People in Depth , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fengyun Rao,et al.  CLIP4Caption: CLIP for Video Caption , 2021, ACM Multimedia.

[3]  Hongwei Zheng,et al.  Cross-Camera Feature Prediction for Intra-Camera Supervised Person Re-identification across Distant Scenes , 2021, ACM Multimedia.

[4]  Toshihiko Yamasaki,et al.  Very Important Person Localization in Unconstrained Conditions: A New Benchmark , 2021, AAAI.

[5]  Tao Mei,et al.  Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective , 2021, ArXiv.

[6]  David A. Shamma,et al.  AI at the Disco: Low Sample Frequency Human Activity Recognition for Night Club Experiences , 2020, HUMA @ ACM Multimedia.

[7]  Hao-Ting Yang,et al.  Rethinking Relation between Model Stacking and Recurrent Neural Networks for Social Media Prediction , 2020, ACM Multimedia.

[8]  Gangshan Wu,et al.  Visual Relation of Interest Detection , 2020, ACM Multimedia.

[9]  Hanwang Zhang,et al.  Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect , 2020, Neural Information Processing Systems.

[10]  Xian-Sheng Hua,et al.  Interventional Few-Shot Learning , 2020, NeurIPS.

[11]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[12]  MeiTao,et al.  Listen, Look, and Find the One , 2020 .

[13]  Xiao Wang,et al.  Listen, Look, and Find the One , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[14]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[15]  Wei-Shi Zheng,et al.  Learning to Detect Important People in Unlabelled Images for Semi-Supervised Important People Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hanwang Zhang,et al.  Two Causal Principles for Improving Visual Dialog , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  B. Schölkopf,et al.  Causality for Machine Learning , 2019, Probabilistic and Causal Inference.

[20]  Tao Mei,et al.  Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei-Shi Zheng,et al.  Learning to Learn Relation for Important People Detection in Still Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Meng Zhang,et al.  Multi-Granularity Reasoning for Social Relation Recognition From Images , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Bernhard Schölkopf,et al.  Counterfactuals uncover the modular structure of deep generative models , 2018, ICLR.

[24]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Abhinav Dhall,et al.  Role of Group Level Affect to Find the Most Influential Person in Images , 2018, ECCV Workshops.

[26]  Xiangyu Zhang,et al.  CrowdHuman: A Benchmark for Detecting Human in a Crowd , 2018, ArXiv.

[27]  Bernhard Schölkopf,et al.  Learning Independent Causal Mechanisms , 2017, ICML.

[28]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Wei-Shi Zheng,et al.  PersonRank: Detecting Important People in Images , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[30]  Shifeng Zhang,et al.  FaceBoxes: A CPU real-time face detector with high accuracy , 2017, 2017 IEEE International Joint Conference on Biometrics (IJCB).

[31]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[34]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[38]  Andrew C. Gallagher,et al.  VIP: Finding important people in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Pietro Perona,et al.  Visual Causal Feature Learning , 2014, UAI.

[40]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Karl Stratos,et al.  Understanding and predicting importance in images , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .