Learning Deep Contextual Attention Network for Narrative Photo Stream Captioning

While image captioning has been extensively studied, the problem of generating narrative descriptions for photo streams still remains under explored. Photo stream captioning is more challenging due to the large visual variance, complicated object context, and sentence-to-sentence coherence in the ordered collection of photos. To deal with these challenges, we propose a novel deep contextual attention network (CAN) to narratively describe photo streams by jointly exploring the rich context among attended regions and the coherence in sentences. The proposed CAN is designed in an encoder-decoder framework: the encoder models visual contexts via region-level bilinear similarity and selectively focuses on the attention areas with salient context; while a novel hierarchical gated recurrent unit (h-GRU) acts as the decoder to effectively preserve the semantic coherence among the generated sentences. As CAN is capable to exploit visual attention and context in the photo stream, the generated story is more semantically coherent than merely concatenating the isolated individual image captions. We conduct experiments on the SIND dataset and show that CAN outperforms the state-of-the-art methods by 3.1%, 8.9%, and 9.1% in terms of BLEU, METEOR and CIDEr, respectively.

[1]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[2]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[3]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[4]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[5]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[6]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[7]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mohit Sharma,et al.  Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation , 2019, SDM.

[10]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[11]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Eugene Charniak,et al.  Nonparametric Method for Data-driven Image Captioning , 2014, ACL.

[14]  Gunhee Kim,et al.  Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[15]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[16]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20]  Tao Mei,et al.  Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks , 2017, AAAI.

[21]  Tatsuya Harada,et al.  Common Subspace for Model and Similarity: Phrase Learning for Caption Generation from Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[24]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).