A Hierarchical Approach for Generating Descriptive Image Paragraphs

Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[4]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[5]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[9]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[10]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[15]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[16]  Ming Zhou,et al.  Hierarchical Recurrent Neural Network for Document Modeling , 2015, EMNLP.

[17]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[25]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[27]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[28]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[30]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[31]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  S. Moss Listen , 2017 .

[34]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[35]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.