Improved Human-Object Interaction Detection Through On-the-Fly Stacked Generalization

Human-object interaction (HOI) detection, which finds the relationships between humans and objects, is an important research area, but current HOI detection performance is unsatisfactory. One of the main problems is that CNN-based HOI detection algorithms fail to predict correct outputs for unseen test data based on a limited number of available training examples. Herein, we propose a novel framework for HOI detection called the on-the-fly stacked generalization deep neural network (OSGNet). OSGNet consists of three main components: (1) feature extraction modules, (2) HOI relationship detection networks, and (3) a meta-learner for combining the outputs of sub-models. Here, components (1) and (2) are considered to be sub-models. Any task-based feature extraction modules, such as classification or human pose estimation modules, can be used as sub-models. To achieve on-the-fly stacked generalization, the sub-models and meta-learner are trained simultaneously. The sub-models are trained to provide complementary information, and the meta-learner improves the generalization performance for unseen test data. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy, particularly in cases involving rare classes.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[3]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Derek Hoiem,et al.  No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Christoph H. Lampert,et al.  Detecting Visual Relationships Using Box Attention , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[6]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Jungchan Cho,et al.  Body-Part-Aware and Multitask-Aware Single-Image-Based Action Recognition , 2020 .

[8]  Raymond J. Mooney,et al.  Stacking with Auxiliary Features for Visual Question Answering , 2018, NAACL.

[9]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[10]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Fei Wang,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kimin Yun,et al.  Anti-Litter Surveillance based on Person Understanding via Multi-Task Learning , 2020, BMVC.

[13]  Cewu Lu,et al.  HOI Analysis: Integrating and Decomposing Human-Object Interaction , 2020, NeurIPS.

[14]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Jungchan Cho,et al.  Robust Human Pose Estimation for Rotation via Self-Supervised Learning , 2020, IEEE Access.

[18]  B. S. Manjunath,et al.  VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lin Gao,et al.  A Survey on Human Performance Capture and Animation , 2017, Journal of Computer Science and Technology.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Cordelia Schmid,et al.  Detecting Unseen Visual Relations Using Analogies , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yongjin Kwon,et al.  Vision‐based garbage dumping action detection for real‐world surveillance platform , 2019, ETRI Journal.

[23]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[24]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[26]  Ping Wang,et al.  Ensemble of machine learning algorithms using the stacked generalization approach to estimate the warfarin dose , 2018, PloS one.

[27]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[28]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[31]  Junqi Liu,et al.  Detailed 2D-3D Joint Representation for Human-Object Interaction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  Mingmin Chi,et al.  Relation Parsing Neural Network for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).