Album Story 1 Description for Images in Isolation & in Sequences Re-telling Story 1 Caption in Sequence Storytelling Story 2 Story 3 Re-telling Preferred Photo Sequence Story 4 Story

We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. We argue that modelling figurative and social language, as provided for in this data and the storytelling task, has the potential to move artificial intelligence towards more human-like expression and understanding.

[1]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[2]  Francis Ferraro,et al.  A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[3]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[4]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[5]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[6]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[7]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[8]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[9]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[10]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[13]  P. Wiessner Embers of society: Firelight talk among the Ju/’hoansi Bushmen , 2014, Proceedings of the National Academy of Sciences.

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[18]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[20]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[21]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[22]  Yejin Choi,et al.  Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition , 2015, NAACL.

[23]  Li Fei-Fei,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).