Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art. Experimental results on the MS-COCO and a larger scale Stock3M datasets show that our algorithm yields consistent improvements across different evaluation metrics, especially on the SPICE metric, which has much higher correlation with human ratings than the conventional metrics. Furthermore, our algorithm can generate descriptions with varied length, benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences.

[1]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[4]  Ronan Collobert,et al.  Phrase-based Image Captioning , 2015, ICML.

[5]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[6]  Keiji Tanaka,et al.  Inferotemporal cortex and object vision. , 1996, Annual review of neuroscience.

[7]  Chee Seng Chan,et al.  phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning , 2016, ACCV.

[8]  W. Singer,et al.  Dynamic predictions: Oscillations and synchrony in top–down processing , 2001, Nature Reviews Neuroscience.

[9]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[10]  C. Gilbert,et al.  The Neural Basis of Perceptual Learning , 2001, Neuron.

[11]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[12]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[15]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[16]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[17]  J. Duncan Selective attention and the organization of visual information. , 1984, Journal of experimental psychology. General.

[18]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[19]  J. Bullier Integrated model of visual processing , 2001, Brain Research Reviews.

[20]  J. Bullier,et al.  Parallel versus serial processing: new vistas on the distributed organization of the visual system , 1995, Current Opinion in Neurobiology.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  M. Bar A Cortical Mechanism for Triggering Top-Down Facilitation in Visual Object Recognition , 2003, Journal of Cognitive Neuroscience.

[23]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[24]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[28]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[30]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[31]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[35]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[36]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[37]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[40]  P. Strevens Iii , 1985 .

[41]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[42]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[43]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[44]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[45]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[46]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.