论文信息 - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOI

Ganesh Ramakrishnan | Rishabh Dabral | Sai Praneeth Reddy Sunkesula | Ganesh Ramakrishnan | Rishabh Dabral

[1] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2] Xuming He,et al. Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Ganesh Ramakrishnan,et al. Multi-Person 3D Human Pose Estimation from Monocular Images , 2019, 2019 International Conference on 3D Vision (3DV).

[4] Shifeng Zhang,et al. Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd , 2018, ECCV.

[5] Ganesh Ramakrishnan,et al. Collective annotation of Wikipedia entities in web text , 2009, KDD.

[6] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Ganesh Ramakrishnan,et al. Robust Data Programming with Precision-guided Labeling Functions , 2020, AAAI.

[8] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[9] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[10] Cordelia Schmid,et al. Speech2Action: Cross-Modal Supervision for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Ganesh Ramakrishnan,et al. An Interactive Multi-Label Consensus Labeling Model for Multiple Labeler Judgments , 2018, AAAI.

[12] Dinesh Manocha,et al. EVA: Generating Emotional Behavior of Virtual Agents using Expressive Features of Gait and Gaze , 2019, SAP.

[13] Michael Werman,et al. A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[14] Yu Cao,et al. Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[15] Ivan Laptev,et al. Learning person-object interactions for action recognition in still images , 2011, NIPS.

[16] Cewu Lu,et al. Transferable Interactiveness Prior for Human-Object Interaction Detection , 2018, ArXiv.

[17] Peter V. Gehler,et al. Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[18] Iasonas Kokkinos,et al. DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20] Shiliang Pu,et al. Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[21] Gerard Pons-Moll,et al. 360-Degree Textures of People in Clothing from a Single Image , 2019, 2019 International Conference on 3D Vision (3DV).

[22] Pascal Fua,et al. XNect , 2019, ACM Trans. Graph..

[23] Jian-Huang Lai,et al. Recognising Human-Object Interaction via Exemplar Based Modelling , 2013, 2013 IEEE International Conference on Computer Vision.

[24] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[26] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[28] Dinesh Manocha,et al. EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege’s Principle , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Chaitanya Patel,et al. TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Lei Shi,et al. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Tat-Seng Chua,et al. Video Visual Relation Detection , 2017, ACM Multimedia.

[33] Jitendra Malik,et al. Visual Semantic Role Labeling , 2015, ArXiv.

[34] Gangshan Wu,et al. Video Visual Relation Detection via Multi-modal Feature Fusion , 2019, ACM Multimedia.

[35] Silvio Savarese,et al. Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Zhenghao Chen,et al. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Pararth Shah,et al. Learning to Collectively Link Entities , 2016, CODS.

[38] Yuning Jiang,et al. Repulsion Loss: Detecting Pedestrians in a Crowd , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40] Hema Swetha Koppula,et al. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42] Mohan S. Kankanhalli,et al. Interact as You Intend: Intention-Driven Human-Object Interaction Detection , 2018, IEEE Transactions on Multimedia.

[43] Abhishek Sharma,et al. Learning 3D Human Pose from Structure and Motion , 2017, ECCV.

[44] Hyunwoo Kim,et al. Mixed Effects Neural Networks (MeNets) With Applications to Gaze Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Ali Farhadi,et al. Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Mohan S. Kankanhalli,et al. Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Cordelia Schmid,et al. Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Yadong Mu,et al. Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Nanning Zheng,et al. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.