DIRV: Dense Interaction Region Voting for End-to-End Human-Object Interaction Detection

Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional two-stage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI detection approach DIRV in this paper, based on a new concept called interaction region for the HOI problem. Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair, so as to capture the subtle visual features that is most essential to the interaction. Moreover, in order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy that makes full use of those overlapped interaction regions in place of conventional Non-Maximal Suppression (NMS). Extensive experiments on two popular benchmarks: V-COCO and HICO-DET show that our approach outperforms existing state-of-the-arts by a large margin with the highest inference speed and lightest network architecture. We achieved 56.1 mAP on V-COCO without addtional input. Our code will be made publicly available.

[1]  Cewu Lu,et al.  Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Fahad Shahbaz Khan,et al.  Learning Human-Object Interaction Detection Using Interaction Points , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cewu Lu,et al.  HAKE: Human Activity Knowledge Engine , 2019, ArXiv.

[4]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Sarah Parisot,et al.  Learning Conditioned Graph Structures for Interpretable Visual Question Answering , 2018, NeurIPS.

[6]  Fei Wang,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Sequence-Level Image Captioning , 2018, ACM Multimedia.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[12]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[13]  Hao Zhu,et al.  CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Cewu Lu,et al.  Pairwise Body-Part Attention for Recognizing Human-Object Interactions , 2018, ECCV.

[16]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[17]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Hanqing Lu,et al.  Aligning Linguistic Words and Visual Semantic Units for Image Captioning , 2019, ACM Multimedia.

[20]  Rama Chellappa,et al.  Detecting Human-Object Interactions via Functional Generalization , 2019, AAAI.

[21]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Chang Wen Chen,et al.  ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection , 2020, ACM Multimedia.

[24]  Wenguan Wang,et al.  Cascaded Human-Object Interaction Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jaewoo Kang,et al.  UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection , 2020, ECCV.

[28]  Li Wang,et al.  Learning Actor Relation Graphs for Group Activity Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ling Shao,et al.  Learning Compositional Neural Information Fusion for Human Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[34]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques , 2018, ArXiv.

[35]  Xu Sun,et al.  Human Object Interaction Detection via Multi-level Conditioned Network , 2020, ICMR.

[36]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  B. S. Manjunath,et al.  VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Cewu Lu,et al.  Further Understanding Videos through Adverbs: A New Video Task , 2020, AAAI.