Category Query Learning for Human-Object Interaction Classification

Unlike most previous HOI methods that focus on learning better human-object features, we propose a novel and complementary approach called category query learning. Such queries are explicitly associated to interaction categories, converted to image specific category representation via a transformer decoder, and learnt via an auxiliary image-level classification task. This idea is motivated by an earlier multi-label image classification method, but is for the first time applied for the challenging human-object interaction classification task. Our method is simple, general and effective. It is validated on three representative HOI baselines and achieves new state-of-the-art results on two benchmarks.

[1]  Samuel Albanie,et al.  RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection , 2022, NeurIPS.

[2]  Cewu Lu,et al.  Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection , 2022, ECCV.

[3]  Shaoli Huang,et al.  Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection , 2022, ECCV.

[4]  Changxing Ding,et al.  Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ting Yao,et al.  Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Errui Ding,et al.  Human-Object Interaction Detection via Disentangled Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chi-Keung Tang,et al.  Interactiveness Field in Human-Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hyunwoo J. Kim,et al.  Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luxin Yan,et al.  Category-Aware Transformer Network for Better Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  A S M Iftekhar,et al.  What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jonghwan Mun,et al.  MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xiaobo Li,et al.  GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xiangyu Yue,et al.  RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation , 2022, ECCV.

[14]  Frederic Z. Zhang,et al.  Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chen Gao,et al.  Mining the Benefits of Two-stage and One-stage HOI Detection , 2021, NeurIPS.

[17]  Jun Zhu,et al.  Query2Label: A Simple Transformer Way to Multi-Label Classification , 2021, ArXiv.

[18]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  D. Tao,et al.  Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  C. Qian,et al.  Reformulating HOI Detection as Adaptive Set Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tomoaki Yoshinaga,et al.  QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Frederic Z. Zhang,et al.  Spatially Conditioned Graphs for Detecting Human–Object Interactions , 2020, IEEE International Conference on Computer Vision.

[25]  Dacheng Tao,et al.  Polysemy Deciphering Network for Robust Human–Object Interaction Detection , 2020, International Journal of Computer Vision.

[26]  Cewu Lu,et al.  HOI Analysis: Integrating and Decomposing Human-Object Interaction , 2020, NeurIPS.

[27]  Wei-Shi Zheng,et al.  Contextual Heterogeneous Graph Network for Human-Object Interaction Detection , 2020, ECCV.

[28]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[29]  Jaewoo Kang,et al.  UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection , 2020, ECCV.

[30]  Andrew Zisserman,et al.  Amplifying Key Cues for Human-Object-Interaction Detection , 2020, ECCV.

[31]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[32]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[33]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Fahad Shahbaz Khan,et al.  Learning Human-Object Interaction Detection Using Interaction Points , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  B. S. Manjunath,et al.  VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jiashi Feng,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Rama Chellappa,et al.  Detecting Human-Object Interactions via Functional Generalization , 2019, AAAI.

[38]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[43]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[44]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[48]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.