论文信息 - Hierarchically-Attentive RNN for Album Summarization and Storytelling

Hierarchically-Attentive RNN for Album Summarization and Storytelling

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.

Licheng Yu | Tamara L. Berg | Mohit Bansal | Mohit Bansal | Licheng Yu

[1] Gunhee Kim,et al. Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[2] Kristen Grauman,et al. Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[3] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[4] Tae-Hyun Oh,et al. Textually Customized Video Summaries , 2017, ArXiv.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] S. T. Buckland,et al. An Introduction to the Bootstrap. , 1994 .

[7] Tae-Hyun Oh,et al. Contextually Customized Video Summaries Via Natural Language , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8] Chih-Jen Lin,et al. Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[10] Ke Zhang,et al. Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Xinlei Chen,et al. Learning Visual Storylines with Skipping Recurrent Neural Networks , 2016, ECCV.

[12] Robert Tibshirani,et al. An Introduction to the Bootstrap , 1994 .

[13] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[15] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Fei Sha,et al. Supplementary Material : Video Summarization with Long Short-term Memory , 2016 .

[17] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Matthew R. Walter,et al. What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment , 2015, NAACL.

[20] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Eric P. Xing,et al. Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[23] Dhruv Batra,et al. Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[24] Ben Taskar,et al. Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[25] Mirella Lapata,et al. Automatic Generation of Story Highlights , 2010, ACL.

[26] Navdeep Jaitly,et al. Pointer Networks , 2015, NIPS.

[27] Luc Van Gool,et al. Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Mirella Lapata,et al. Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[30] Gunhee Kim,et al. Joint photo stream and blog post summarization and exploration , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.