Towards Coherent Visual Storytelling with Ordered Image Attention

We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each sentence of the story should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. To achieve this we develop ordered image attention (OIA). OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence. To highlight the important objects, a message-passing-like algorithm collects representations of those objects in an order-aware manner. To generate the story’s sentences, we then highlight important image attention vectors with an Image-Sentence Attention (ISA). Further, to alleviate common linguistic mistakes like repetitiveness, we introduce an adaptive prior. The obtained results improve the METEOR score on the VIST dataset by 1%. In addition, an extensive human study verifies coherency improvements and shows that OIA and ISA generated stories are more focused, shareable, and imagegrounded.

[1]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Mauro Cettolo,et al.  Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation , 2013, MTSUMMIT.

[4]  Zhe Gan,et al.  Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[5]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[6]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Diana Gonzalez-Rico,et al.  Contextualize, Show and Tell: A Neural Visual Storyteller , 2018, ArXiv.

[8]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Hexiang Hu,et al.  Visual Storytelling via Predicting Anchor Word Embeddings in the Stories , 2020, ArXiv.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Lun-Wei Ku,et al.  Knowledge-Enriched Visual Storytelling , 2019, AAAI.

[16]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Piji Li,et al.  Storytelling from an Image Stream Using Scene Graphs , 2020, AAAI.

[18]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[19]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Peng Gao,et al.  Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  Byoung-Tak Zhang,et al.  GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation , 2018, ArXiv.

[26]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[28]  Gunhee Kim,et al.  Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[29]  Licheng Yu,et al.  Hierarchically-Attentive RNN for Album Summarization and Storytelling , 2017, EMNLP.

[30]  Wei Zhang,et al.  Hierarchical Photo-Scene Encoder for Album Storytelling , 2019, AAAI.

[31]  Tamir Hazan,et al.  Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yueting Zhuang,et al.  Informative Visual Storytelling with Cross-modal Rules , 2019, ACM Multimedia.

[33]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Natalie Parde,et al.  The Steep Road to Happily Ever after: an Analysis of Current Visual Storytelling Models , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[37]  Lei Li,et al.  Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling , 2019, IJCAI.

[38]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[39]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Tamir Hazan,et al.  High-Order Attention Models for Visual Question Answering , 2017, NIPS.

[42]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.