Diverse and Relevant Visual Storytelling with Scene Graph Embeddings

A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.

[1]  Ali Borji,et al.  Human Attention in Image Captioning: Dataset and Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Tao Mei,et al.  Learning Deep Contextual Attention Network for Narrative Photo Stream Captioning , 2017, ACM Multimedia.

[4]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[6]  Trevor Cohn,et al.  On the Role of Scene Graphs in Image Captioning , 2019, EMNLP.

[7]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[8]  Lei Li,et al.  Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling , 2019, IJCAI.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Dong Yu,et al.  Structural Information Preserving for Graph-to-Text Generation , 2020, ACL.

[13]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Licheng Yu,et al.  Hierarchically-Attentive RNN for Album Summarization and Storytelling , 2017, EMNLP.

[16]  Natalie Parde,et al.  The Steep Road to Happily Ever after: an Analysis of Current Visual Storytelling Models , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[17]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhe Gan,et al.  Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[19]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Deng Cai,et al.  Graph Transformer for Graph-to-Sequence Learning , 2019, AAAI.

[22]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Tao Mei,et al.  Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks , 2017, AAAI.

[25]  Bo Wang,et al.  Image Captioning with Scene-graph Based Semantic Concepts , 2018, ICMLC.

[26]  Jing Wang,et al.  Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training , 2018, AAAI.

[27]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Wei Zhang,et al.  Hierarchical Photo-Scene Encoder for Album Storytelling , 2019, AAAI.

[29]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[30]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Mark Hopkins,et al.  Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples , 2018, ACL.

[32]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[33]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[34]  Alan W Black,et al.  Learning to Order Graph Elements with Application to Multilingual Surface Realization , 2019, EMNLP.

[35]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[36]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Piji Li,et al.  Storytelling from an Image Stream Using Scene Graphs , 2020, AAAI.

[40]  Diego Marcheggiani,et al.  Deep Graph Convolutional Encoders for Structured Data to Text Generation , 2018, INLG.

[41]  Xudong Hong,et al.  Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders , 2019, EMNLP.

[42]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[43]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).