High-Order Interaction Learning for Image Captioning

Image captioning aims at understanding various semantic concepts (e.g., objects and relationships) from an image and integrating them in a sentence-level description. Hence, it is necessary to learn the interaction among these concepts. If we define the context of the interaction to be involved in the subject-predicate-object triplet, most current methods only focus on the single triplet for the first-order interaction to generate sentences. Intuitively, we humans are able to perceive the high-order interaction among concepts from two or more triplets to describe an image. For example, when we see the triplets man-cutting-sandwich and man-with-knife, it is natural to integrate and predict the sentence man cutting sandwich with knife. This depends on the high-order interaction between cutting and knife in different triplets. Therefore, exploiting high-order interaction is expected to benefit image captioning and focus on reasoning. In this paper, we introduce the novel high-order interaction learning method over detected objects and relationships for image captioning under the umbrella of the encoder-decoder framework. We first extract a set of object and relationship features in an image. During the encoding stage, the interactive refining network is proposed to learn high-order representations by modeling intra- and inter-object feature interaction in the self-attention fashion. During the decoding stage, the interactive fusion network is proposed to integrate object and relationship information by strengthening their high-order interaction based on language context for sentence generation. In this way, we learn the object-relationship dependencies in different stages, which can provide abundant cues for both visual understanding and caption generation. Extensive experiments show that the proposed method can achieve competitive performances against the state-of-the-art methods on MSCOCO dataset. Additional ablation studies further validate its effectiveness.

[1]  Tao Mei,et al.  Noise Augmented Double-Stream Graph Convolutional Networks for Image Captioning , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Yuxin Peng,et al.  Unsupervised Cross-Media Retrieval Using Domain Adaptation With Scene Graph , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Jiebo Luo,et al.  Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Xu Zhou,et al.  Improving Image Captioning with Better Use of Caption , 2020, ACL.

[5]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tianhao Zhang,et al.  Exploring Semantic Relationships for Image Captioning without Parallel Data , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[7]  J. Tenenbaum,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2019, ICLR.

[8]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Hanqing Lu,et al.  Aligning Linguistic Words and Visual Semantic Units for Image Captioning , 2019, ACM Multimedia.

[11]  Yongdong Zhang,et al.  Dual-Stream Recurrent Neural Network for Video Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  R. Mooney,et al.  Generating Question Relevant Captions to Aid Visual Question Answering , 2019, ACL.

[13]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  J. Tenenbaum,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision , 2019, ICLR.

[15]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[16]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Junwei Han,et al.  Attentive Linear Transformation for Image Captioning , 2018, IEEE Transactions on Image Processing.

[18]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[19]  Nannan Li,et al.  Image Captioning with Visual-Semantic LSTM , 2018 .

[20]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[21]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Tao Mei,et al.  Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions , 2018, EMNLP.

[23]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[25]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Albert Gordo,et al.  Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  T. Gevers,et al.  Words Matter: Scene Text for Image Classification and Retrieval , 2017, IEEE Transactions on Multimedia.

[30]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chuang Gan,et al.  Recurrent Topic-Transition GAN for Visual Paragraph Generation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Shaomei Wu,et al.  Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service , 2017, CSCW.

[33]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[39]  Stephen Gould,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[40]  Xiao Lin,et al.  Leveraging Visual Question Answering for Image-Caption Ranking , 2016, ECCV.

[41]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[50]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[54]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[55]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[56]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[57]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[58]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[59]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[60]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[61]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[62]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[63]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[64]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .