Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

Image Captioning aims at generating a short description for an image. Existing research usually employs the architecture of CNN-RNN that views the generation as a sequential decision-making process and the entire dataset vocabulary is used as decoding space. They suffer from generating high frequent n-gram with irrelevant words. To tackle this problem, we propose to construct an image-grounded vocabulary, based on which, captions are generated with limitation and guidance. In specific, a novel hierarchical structure is proposed to construct the vocabulary incorporating both visual information and relations among words. For generation, we propose a word-aware RNN cell incorporating vocabulary information into the decoding process directly. Reinforce algorithm is employed to train the generator using constraint vocabulary as action space. Experimental results on MS COCO and Flickr30k show the effectiveness of our framework compared to some state-of-the-art models.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Xuanjing Huang,et al.  A Reinforcement Learning Framework for Natural Question Generation using Bi-discriminators , 2018, COLING.

[3]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Xuanjing Huang,et al.  A Question Type Driven Framework to Diversify Visual Question Generation , 2018, IJCAI.

[5]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[6]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[11]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[12]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[15]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tao Qin,et al.  Question Answering and Question Generation as Dual Tasks , 2017, ArXiv.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[22]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[24]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhoujun Li,et al.  Neural Response Generation with Dynamic Vocabularies , 2017, AAAI.

[26]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[27]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[28]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[38]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[39]  Sanja Fidler,et al.  Generating Multi-Sentence Lingual Descriptions of Indoor Scenes , 2015, ArXiv.

[40]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).