Diverse and Coherent Paragraph Generation from Images

Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren’t designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn’t embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with “coherence vectors,” “global topic vectors,” and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.

[1]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Svetlana Lazebnik,et al.  Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[10]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[11]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[12]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Chin-Hui Lee,et al.  Fusion of Region and Image-Based Techniques for Automatic Image Annotation , 2007, MMM.

[17]  Anton Leuski,et al.  A Novel Statistical Approach for Image and Video Retrieval and Its Adaption for Active Learning , 2015, ACM Multimedia.

[18]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[22]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[26]  Svetlana Lazebnik,et al.  Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[29]  Tamir Hazan,et al.  High-Order Attention Models for Visual Question Answering , 2017, NIPS.

[30]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[34]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[36]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[37]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[38]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Pengtao Xie,et al.  Diversity-Promoting and Large-Scale Machine Learning for Healthcare , 2018 .

[41]  David A. Forsyth,et al.  Diverse and Controllable Image Captioning with Part-of-Speech Guidance , 2018, ArXiv.

[42]  Christos Faloutsos,et al.  Automatic image captioning , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[43]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[44]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[45]  Chuang Gan,et al.  Recurrent Topic-Transition GAN for Visual Paragraph Generation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.