Relational graph neural network for situation recognition

Abstract Recently, situation recognition as a new challenging task for image understanding has gained great attention, which needs to simultaneously predict the main activity (verb) and its associated objects (noun entities) in a structured and detailed way. Several methods have been proposed to handle this task, but usually they cannot effectively model the relationships between the activity and the objects. In this paper, we propose a Relational Graph Neural Network (RGNN) for situation recognition, which builds a neural graph on the activity and the objects, and models the triplet relationships between the activity and pairs of objects through message passing between graph nodes. Moreover, we propose a two-stage training strategy to optimize the model. A progressive supervised learning is first adopted to obtain an initial prediction for the activity and the objects. Then, the initial predictions are refined by using a policy-gradient method to directly optimize the non-differentiable value-all metric. To verify the effectiveness of our method, we perform extensive experiments on the Imsitu dataset which is currently the only available dataset for situation recognition. Experimental results show that our approach outperforms the state-of-the-art methods on verb and value metrics, and demonstrates better relationships between the activity and the objects.

[1]  Mirella Lapata,et al.  Graph Alignment for Semi-Supervised Semantic Role Labeling , 2009, EMNLP.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jianfei Cai,et al.  Action Recognition in Still Images With Minimum Annotation Efforts , 2016, IEEE Transactions on Image Processing.

[7]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[8]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Zhiwu Lu,et al.  Learning descriptive visual representation for image classification and annotation , 2015, Pattern Recognit..

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[14]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[16]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[17]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[18]  Yusen Zhan,et al.  Scalable lifelong reinforcement learning , 2017, Pattern Recognit..

[19]  Sanja Fidler,et al.  Situation Recognition with Graph Neural Networks , 2018 .

[20]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  Shuicheng Yan,et al.  Semantic Object Parsing with Graph LSTM , 2016, ECCV.

[23]  Shu Wang,et al.  Mining intricate temporal rules for recognizing complex activities of daily living under uncertainty , 2016, Pattern Recognit..

[24]  Ji Zhang,et al.  Relationship Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[32]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[33]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Shiming Xiang,et al.  Dense semantic embedding network for image captioning , 2019, Pattern Recognit..

[35]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[36]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[37]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..