Exploring region relationships implicitly: Image captioning with visual relationship attention

Abstract Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly addressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can generate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relationship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions.

[1]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[2]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[6]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yang Wang,et al.  Fine-Grained and Semantic-Guided Visual Attention for Image Captioning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[10]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[11]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Qiang Wu,et al.  High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention , 2019, IEEE Transactions on Multimedia.

[15]  Jorma Laaksonen,et al.  Paying Attention to Descriptions Generated by Image Captioning Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[20]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[24]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[26]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Heng Tao Shen,et al.  Hierarchical LSTMs with Adaptive Attention for Visual Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[30]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[33]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xiaosong Zhao,et al.  Image caption model of double LSTM with scene factors , 2019, Image Vis. Comput..

[35]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Heng Tao Shen,et al.  Unified Binary Generative Adversarial Network for Image Retrieval and Compression , 2020, International Journal of Computer Vision.

[38]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39]  Yang Wang,et al.  Visual Relationship Attention for Image Captioning , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[40]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[41]  Hongtao Lu,et al.  Look Back and Predict Forward in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).