End-to-End Human Object Interaction Detection with HOI Transformer

We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves 26.61% AP on HICO-DET and 52.9% AProle on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer.

[1]  Wei-Shi Zheng,et al.  Contextual Heterogeneous Graph Network for Human-Object Interaction Detection , 2020, ECCV.

[2]  Mohan S. Kankanhalli,et al.  Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[4]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Y. Qiao,et al.  Visual Compositional Learning for Human-Object Interaction Detection , 2020, ECCV.

[6]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[7]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[8]  Li Fei-Fei,et al.  Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Rama Chellappa,et al.  Detecting Human-Object Interactions via Functional Generalization , 2019, AAAI.

[10]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[11]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[12]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mingmin Chi,et al.  Relation Parsing Neural Network for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[18]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[19]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Wenguan Wang,et al.  Cascaded Human-Object Interaction Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Fahad Shahbaz Khan,et al.  Learning Human-Object Interaction Detection Using Interaction Points , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[28]  Cordelia Schmid,et al.  Detecting Unseen Visual Relations Using Analogies , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Andrew Zisserman,et al.  Amplifying Key Cues for Human-Object-Interaction Detection , 2020, ECCV.

[31]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fei Wang,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[34]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Jaewoo Kang,et al.  UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection , 2020, ECCV.

[38]  Dacheng Tao,et al.  Polysemy Deciphering Network for Human-Object Interaction Detection , 2020, ECCV.

[39]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[40]  B. S. Manjunath,et al.  VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).