Human-Object Interaction Detection via Weak Supervision

The goal of this paper is Human-object Interaction (HO-I) detection. HO-I detection aims to find interacting human-objects regions and classify their interaction from an image. Researchers obtain significant improvement in recent years by relying on strong HO-I alignment supervision from [5]. HO-I alignment supervision pairs humans with their interacted objects, and then aligns human-object pair(s) with their interaction categories. Since collecting such annotation is expensive, in this paper, we propose to detect HO-I without alignment supervision. We instead rely on image-level supervision that only enumerates existing interactions within the image without pointing where they happen. Our paper makes three contributions: i) We propose Align-Former, a visualtransformer based CNN that can detect HO-I with only image-level supervision. ii) Align-Former is equipped with HO-I align layer, that can learn to select appropriate targets to allow detector supervision. iii) We evaluate Align-Former on HICO-DET [5] and V-COCO [13], and show that Align-Former outperforms existing image-level supervised HO-I detectors by a large margin (4.71% mAP improvement from 16.14%→ 20.85% on HICO-DET [5]).

[1]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[2]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Mert Kilickaya,et al.  Structured Visual Search via Composition-aware Learning , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Fei Wang,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Serge J. Belongie,et al.  Convolutional Networks with Adaptive Inference Graphs , 2017, International Journal of Computer Vision.

[7]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[8]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[9]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[13]  Max Welling,et al.  Batch-shaping for learning conditional channel gated networks , 2019, ICLR.

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[15]  Andrew Zisserman,et al.  Amplifying Key Cues for Human-Object-Interaction Detection , 2020, ECCV.

[16]  Si Liu,et al.  Reformulating HOI Detection as Adaptive Set Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques , 2018, ArXiv.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[20]  Y. Qiao,et al.  Visual Compositional Learning for Human-Object Interaction Detection , 2020, ECCV.

[21]  Yang Li,et al.  You Look Twice: GaterNet for Dynamic Filter Selection in CNNs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[23]  Mert Kilickaya,et al.  Self-Selective Context for Interaction Recognition , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[24]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Miaojing Shi,et al.  Detecting Human-Object Interaction with Mixed Supervision , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[28]  Tomoaki Yoshinaga,et al.  QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).