Cascaded Parsing of Human-Object Interaction Recognition

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering the intrinsic complexity and structural nature of the task, we introduce a cascaded parsing network (CP-HOI) for a multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines HOI proposals and feeds them into a structured interaction reasoning module. Each of the two modules is also connected to its predecessor in the previous stage, enabling efficient cross-stage information propagation. The structured interaction reasoning module is built upon a graph parsing neural network (GPNN), which efficiently models potential HOI structures as graphs and mines rich context for comprehensive relation understanding. In particular, GPNN infers a parse graph that i) interprets meaningful HOI structures by a learnable adjacency matrix, and ii) predicts action (edge) labels. Within an end-to-end, message-passing framework, GPNN blends learning and inference, iteratively parsing HOI structures and reasoning HOI representations (i.e., instance and relation features). Further beyond relation detection at a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. A preliminary version of our CP-HOI model reached 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation. In addition, our CP-HOI shows promising results on two popular HOI recognition benchmarks, i.e., V-COCO and HICO-DET.

[1]  Yi Zhou,et al.  Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation , 2020, AAAI.

[2]  Luc Van Gool,et al.  Video Object Segmentation with Episodic Graph Memory Networks , 2020, ECCV.

[3]  Junsong Yuan,et al.  Discovering Human Interactions With Novel Objects via Zero-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Cewu Lu,et al.  Further Understanding Videos through Adverbs: A New Video Task , 2020, AAAI.

[6]  B. S. Manjunath,et al.  VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ling Shao,et al.  Hierarchical Human Parsing With Typed Part-Relation Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[10]  Ling Shao,et al.  Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Mingmin Chi,et al.  Relation Parsing Neural Network for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Ling Shao,et al.  Learning Compositional Neural Information Fusion for Human Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Jorma Laaksonen,et al.  Deep Contextual Attention for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Song-Chun Zhu,et al.  Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Mohan S. Kankanhalli,et al.  Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Tang,et al.  Probabilistic Logic Neural Networks for Reasoning , 2019, NeurIPS.

[18]  Li Wang,et al.  Learning Actor Relation Graphs for Group Activity Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Song-Chun Zhu,et al.  Reasoning Visual Dialogs With Structural and Partial Observations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Cordelia Schmid,et al.  Detecting Unseen Visual Relations Using Analogies , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Cordelia Schmid,et al.  A Structured Model for Action Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[28]  Gang Hua,et al.  Object Affordances Graph Network for Action Recognition , 2019, BMVC.

[29]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[30]  Yu Qiao,et al.  Find and Focus: Retrieve and Localize Video Events with Natural Language Queries , 2018, ECCV.

[31]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[32]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[33]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[34]  Cewu Lu,et al.  Pairwise Body-Part Attention for Recognizing Human-Object Interactions , 2018, ECCV.

[35]  Song-Chun Zhu,et al.  Attribute And-Or Grammar for Joint Parsing of Human Pose, Parts and Attributes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Sarah Parisot,et al.  Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.

[37]  Song-Chun Zhu,et al.  Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Qi Wu,et al.  HCVRD: A Benchmark for Large-Scale Human-Centered Visual Relationship Detection , 2018, AAAI.

[39]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Li Fei-Fei,et al.  Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[45]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[46]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Xiaogang Wang,et al.  Chained Cascade Network for Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[50]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[51]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Abhinav Gupta,et al.  The More You Know: Using Knowledge Graphs for Image Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[54]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[57]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[58]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[60]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[61]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.

[62]  James J. Gibson,et al.  The Ecological Approach to Visual Perception: Classic Edition , 2014 .

[63]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[64]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[65]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[66]  Jian-Huang Lai,et al.  Recognising Human-Object Interaction via Exemplar Based Modelling , 2013, 2013 IEEE International Conference on Computer Vision.

[67]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[68]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[69]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[70]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[71]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[72]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[74]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Daniel P. Huttenlocher,et al.  Efficient Belief Propagation for Early Vision , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[76]  H. Schneiderman Feature-centric evaluation for efficient cascaded object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[77]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[78]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.