VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.

[1]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[2]  Dongyan Zhao,et al.  Stick to the Facts: Learning towards a Fidelity-oriented E-Commerce Product Description Generation , 2019, EMNLP.

[3]  Florian Metze,et al.  Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Dongyan Zhao,et al.  Abstractive Text Summarization by Incorporating Reader Comments , 2018, AAAI.

[6]  Xiang Ao,et al.  Reading Like HER: Human Reading Inspired Extractive Summarization , 2019, EMNLP.

[7]  Ying Chen,et al.  Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network , 2018, ACL.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[10]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xu Sun,et al.  A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification , 2018, IJCAI.

[12]  Min Sun,et al.  A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss , 2018, ACL.

[13]  Dietrich Klakow,et al.  Improving Latent Alignment in Text Summarization by Generalizing the Pointer Generator , 2019, EMNLP.

[14]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[15]  Dongyan Zhao,et al.  Iterative Document Representation Learning Towards Summarization with Polishing , 2018, EMNLP.

[16]  Rui Yan,et al.  Learning towards Abstractive Timeline Summarization , 2019, IJCAI.

[17]  Giuseppe Carenini,et al.  Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[20]  Huang Heyan,et al.  Concept Pointer Network for Abstractive Summarization , 2019, EMNLP.

[21]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Haoran Li,et al.  Multi-modal Sentence Summarization with Modality Attention and Image Filtering , 2018, IJCAI.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Yu Zhou,et al.  Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.

[25]  Dacheng Tao,et al.  Image-Question-Answer Synergistic Network for Visual Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dongyan Zhao,et al.  RPM-Oriented Query Rewriting Framework for E-commerce Keyword-Based Sponsored Search , 2020, AAAI.

[27]  Dongyan Zhao,et al.  From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information , 2020, IJCAI.

[28]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[29]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[30]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[31]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[32]  Dongyan Zhao,et al.  Multi-Representation Fusion Network for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2019, WSDM.

[33]  Dongyan Zhao,et al.  How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing , 2019, EMNLP.

[34]  Tianyi Zhou,et al.  Conditional Self-Attention for Query-based Summarization , 2020, ArXiv.

[35]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[36]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[37]  Mirella Lapata,et al.  Neural Latent Extractive Document Summarization , 2018, EMNLP.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[40]  Li Liu,et al.  Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog , 2020, WWW.

[41]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[43]  Zhenglu Yang,et al.  Attention Optimization for Abstractive Document Summarization , 2019, EMNLP.