Visual Storytelling with Hierarchical BERT Semantic Guidance

Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.

[1]  In So Kweon,et al.  Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling , 2020, AAAI.

[2]  Dhruv Batra,et al.  Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[3]  Lun-Wei Ku,et al.  Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling , 2018, ArXiv.

[4]  Tuan-Dung Cao,et al.  HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data , 2021 .

[5]  Piji Li,et al.  Storytelling from an Image Stream Using Scene Graphs , 2020, AAAI.

[6]  Piji Li,et al.  Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication , 2019, COLING.

[7]  Xia Feng,et al.  Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey , 2017, Multimedia Tools and Applications.

[8]  Hiroya Takamura,et al.  Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling , 2021, AAAI.

[9]  Dimo Angelov,et al.  Top2Vec: Distributed Representations of Topics , 2020, ArXiv.

[10]  Zhe Gan,et al.  Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[11]  Xin Wang,et al.  Efficient algorithms for graph regularized PLSA for probabilistic topic modeling , 2019, Pattern Recognit..

[12]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[14]  Ruifeng Xu,et al.  Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning , 2021, AAAI.

[15]  Wei Zhang,et al.  Hierarchical Photo-Scene Encoder for Album Storytelling , 2019, AAAI.

[16]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[17]  Lun-Wei Ku,et al.  Knowledge-Enriched Visual Storytelling , 2019, AAAI.

[18]  Yueting Zhuang,et al.  Informative Visual Storytelling with Cross-modal Rules , 2019, ACM Multimedia.

[19]  Eric P. Xing,et al.  Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Licheng Yu,et al.  Hierarchically-Attentive RNN for Album Summarization and Storytelling , 2017, EMNLP.

[21]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[22]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[23]  Lei Li,et al.  Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling , 2019, IJCAI.

[24]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Brent Harrison,et al.  A Hierarchical Approach for Visual Storytelling Using Image Description , 2019, ICIDS.

[27]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[28]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[30]  Xinlei Chen,et al.  Learning Visual Storylines with Skipping Recurrent Neural Networks , 2016, ECCV.