Grounded Situation Recognition with Transformers

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark. Our code is available at https://github.com/jhcho99/gsrtr. Figure 1: Predictions of our model on the SWiG dataset.

[1]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[4]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Ngai-Man Cheung,et al.  Attention-Based Context Aware Reasoning for Situation Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[9]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[11]  Sanja Fidler,et al.  Situation Recognition with Graph Neural Networks , 2018 .

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[15]  Quoc V. Le,et al.  Meta Pseudo Labels , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yichen Wei,et al.  End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Huimin Ma,et al.  Single Image Action Recognition Using Semantic Body Part Actions , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Leonid Sigal,et al.  Mixture-Kernel Graph Attention Network for Situation Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Guohui Tian,et al.  A novel scene classification model combining ResNet based transfer learning and data augmentation with a filter , 2019, Neurocomputing.

[21]  Andrew Zisserman,et al.  Temporal Query Networks for Fine-grained Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[23]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[26]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[27]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[28]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ali Farhadi,et al.  Grounded Situation Recognition , 2020, ECCV.

[31]  Hassan Foroosh,et al.  Still Image Action Recognition by Predicting Spatial-Temporal Pixel Evolution , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Masood S. Mortazavi,et al.  Fully Convolutional Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.