GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

[1]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Deep Learning-Based Image Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Suha Kwak,et al.  Collaborative Transformers for Grounded Situation Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xiangnan He,et al.  Group Contextualization for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Angela Yao,et al.  Video as Conditional Graph Hierarchy for Multi-Granular Question Answering , 2021, AAAI.

[5]  Tat-Seng Chua,et al.  Rethinking the Two-Stage Framework for Grounded Situation Recognition , 2021, AAAI.

[6]  Suha Kwak,et al.  Grounded Situation Recognition with Transformers , 2021, BMVC.

[7]  Chong-Wah Ngo,et al.  Token Shift Transformer for Video Classification , 2021, ACM Multimedia.

[8]  Bodo Rosenhahn,et al.  Spatial-Temporal Transformer for Dynamic Scene Graph Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Olga Russakovsky,et al.  Understanding and Evaluating Racial Biases in Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Lu Yuan,et al.  Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrea Bacciu,et al.  Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources , 2021, NAACL.

[12]  Qi Wu,et al.  Towards Accurate Text-based Image Captioning with Content Diversity Exploration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Arka Sadhu,et al.  Visual Semantic Role Labeling for Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cho-Jui Hsieh,et al.  Robust and Accurate Object Detection via Adversarial Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shuohang Wang,et al.  LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval , 2021, NAACL.

[16]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[17]  Yongjian Wu,et al.  Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network , 2020, AAAI.

[18]  Xiao Wu,et al.  DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition , 2020, Neurocomputing.

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[21]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Chun Yuan,et al.  HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation , 2020, ACM Multimedia.

[23]  Ngai-Man Cheung,et al.  Attention-Based Context Aware Reasoning for Situation Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[25]  Bolei Zhou,et al.  Temporal Pyramid Network for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ali Farhadi,et al.  Grounded Situation Recognition , 2020, ECCV.

[27]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jiashi Feng,et al.  PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Marcella Cornia,et al.  Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Leonid Sigal,et al.  Mixture-Kernel Graph Attention Network for Situation Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[33]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Matthieu Cord,et al.  MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Long Chen,et al.  Counterfactual Critic Multi-Agent Training for Scene Graph Generation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Xiao Wu,et al.  Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search , 2018, ACM Multimedia.

[39]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[40]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[41]  Xi Li,et al.  GNAS: A Greedy Neural Architecture Search Method for Multi-Attribute Learning , 2018, ACM Multimedia.

[42]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[43]  Sanja Fidler,et al.  MovieGraphs: Towards Understanding Human-Centric Situations from Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Yidong Chen,et al.  Deep Semantic Role Labeling with Self-Attention , 2017, AAAI.

[45]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Xiao Wu,et al.  Personalized clothing recommendation combining user social circle and fashion style consistency , 2017, Multimedia Tools and Applications.

[49]  Sanja Fidler,et al.  Situation Recognition with Graph Neural Networks , 2018 .

[50]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Yang Liu,et al.  Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Chong-Wah Ngo,et al.  On the Selection of Anchors and Targets for Video Hyperlinking , 2017, ICMR.

[54]  Yang Liu,et al.  Video eCommerce++: Toward Large Scale Online Video Advertising , 2017, IEEE Transactions on Multimedia.

[55]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[57]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Huimin Ma,et al.  Single Image Action Recognition Using Semantic Body Part Actions , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[66]  Qing Li,et al.  VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking , 2017, TRECVID.

[67]  Yang Liu,et al.  Video eCommerce: Towards Online Video Advertising , 2016, ACM Multimedia.

[68]  Tao Chen,et al.  Context-aware Image Tweet Modelling and Recommendation , 2016, ACM Multimedia.

[69]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[73]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[75]  Cícero Nogueira dos Santos,et al.  Semantic Role Labeling , 2012 .

[76]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[77]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[78]  Collin F. Baker,et al.  Frame semantics for text understanding , 2001 .

[79]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.