论文信息 - Towards Coherent Visual Storytelling with Ordered Image Attention

Towards Coherent Visual Storytelling with Ordered Image Attention

We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each sentence of the story should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. To achieve this we develop ordered image attention (OIA). OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence. To highlight the important objects, a message-passing-like algorithm collects representations of those objects in an order-aware manner. To generate the story’s sentences, we then highlight important image attention vectors with an Image-Sentence Attention (ISA). Further, to alleviate common linguistic mistakes like repetitiveness, we introduce an adaptive prior. The obtained results improve the METEOR score on the VIST dataset by 1%. In addition, an extensive human study verifies coherency improvements and shows that OIA and ISA generated stories are more focused, shareable, and imagegrounded.

Ariel Shamir | Alexander Schwing | Idan Schwartz | Tom Braude

[1] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[2] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3] Mauro Cettolo,et al. Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation , 2013, MTSUMMIT.

[4] Zhe Gan,et al. Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[5] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[6] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Diana Gonzalez-Rico,et al. Contextualize, Show and Tell: A Neural Visual Storyteller , 2018, ArXiv.

[8] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Xin Wang,et al. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[11] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12] Hexiang Hu,et al. Visual Storytelling via Predicting Anchor Word Embeddings in the Stories , 2020, ArXiv.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Lun-Wei Ku,et al. Knowledge-Enriched Visual Storytelling , 2019, AAAI.

[16] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Piji Li,et al. Storytelling from an Image Stream Using Scene Graphs , 2020, AAAI.

[18] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.

[19] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[22] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25] Byoung-Tak Zhang,et al. GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation , 2018, ArXiv.

[26] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[28] Gunhee Kim,et al. Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[29] Licheng Yu,et al. Hierarchically-Attentive RNN for Album Summarization and Storytelling , 2017, EMNLP.

[30] Wei Zhang,et al. Hierarchical Photo-Scene Encoder for Album Storytelling , 2019, AAAI.

[31] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Yueting Zhuang,et al. Informative Visual Storytelling with Cross-modal Rules , 2019, ACM Multimedia.

[33] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Natalie Parde,et al. The Steep Road to Happily Ever after: an Analysis of Current Visual Storytelling Models , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[35] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[37] Lei Li,et al. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling , 2019, IJCAI.

[38] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[39] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Zhou Yu,et al. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[41] Tamir Hazan,et al. High-Order Attention Models for Visual Question Answering , 2017, NIPS.

[42] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.