A Neural Compositional Paradigm for Image Captioning

Mainstream captioning models often follow a sequential structure to generate captions, leading to issues such as introduction of irrelevant semantics, lack of diversity in the generated captions, and inadequate generalization performance. In this paper, we present an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: (1) extracting an explicit semantic representation from the given image; and (2) constructing the caption based on a recursive compositional procedure in a bottom-up manner. Compared to conventional ones, our paradigm better preserves the semantic content through an explicit factorization of semantics and syntax. By using the compositional generation procedure, caption construction follows a recursive structure, which naturally fits the properties of human language. Moreover, the proposed compositional procedure requires less data to train, generalizes better, and yields more diverse captions.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sanja Fidler,et al.  Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[8]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[9]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[10]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[11]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[12]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[13]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[14]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[17]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sanja Fidler,et al.  Generating Multi-Sentence Lingual Descriptions of Indoor Scenes , 2015, ArXiv.

[19]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[20]  Bo Dai,et al.  Contrastive Learning for Image Captioning , 2017, NIPS.

[21]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[24]  Bo Dai,et al.  Rethinking the Form of Latent States in Image Captioning , 2018, ECCV.

[25]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[28]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[31]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Chee Seng Chan,et al.  phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning , 2016, ACCV.

[33]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[34]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[36]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  A. Carnie Syntax: A Generative Introduction , 2002 .

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.