MSMO: Multimodal Summarization with Multimodal Output

Multimodal summarization has drawn much attention due to the rapid growth of multimedia data. The output of the current multimodal summarization systems is usually represented in texts. However, we have found through experiments that multimodal output can significantly improve user satisfaction for informativeness of summaries. In this paper, we propose a novel task, multimodal summarization with multimodal output (MSMO). To handle this task, we first collect a large-scale dataset for MSMO research. We then propose a multimodal attention model to jointly generate text and select the most relevant image from the multimodal input. Finally, to evaluate multimodal outputs, we construct a novel multimodal automatic evaluation (MMAE) method which considers both intra-modality salience and inter-modality relevance. The experimental results show the effectiveness of MMAE.

[1]  Keneilwe Zuva,et al.  Evaluation of Information Retrieval Systems , 2012 .

[2]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[3]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[4]  Changsheng Xu,et al.  Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis , 2016, ACM Multimedia.

[5]  Jiajun Zhang,et al.  Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video , 2019, IEEE Transactions on Knowledge and Data Engineering.

[6]  Haoran Li,et al.  Multi-modal Sentence Summarization with Modality Attention and Image Filtering , 2018, IJCAI.

[7]  Yu Zhou,et al.  GuideRank: A Guided Ranking Graph Model for Multilingual Multi-document Summarization , 2016, NLPCC/ICCPOL.

[8]  Yu Zhou,et al.  Augmenting Neural Sentence Summarization Through Extractive Summarization , 2017, NLPCC.

[9]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[10]  Yi Yu,et al.  Leveraging multimodal information for event summarization and concept-level sentiment analysis , 2016, Knowl. Based Syst..

[11]  Jing Liu,et al.  Multimedia News Summarization in Search , 2016, ACM Trans. Intell. Syst. Technol..

[12]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[13]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[14]  Anastasios Tefas,et al.  Multimodal Stereoscopic Movie Summarization Conforming to Narrative Characteristics , 2016, IEEE Transactions on Image Processing.

[15]  Dragomir R. Radev,et al.  A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization , 2016, HLT-NAACL.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[18]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[19]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.

[20]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[21]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[22]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[23]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  George Karypis,et al.  Evaluation of Item-Based Top-N Recommendation Algorithms , 2001, CIKM '01.

[26]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[27]  Yang Yang,et al.  Multimedia Summarization for Social Events in Microblog Stream , 2015, IEEE Transactions on Multimedia.

[28]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[29]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[30]  Yang Yang,et al.  Multimedia summarization for trending topics in microblogs , 2013, CIKM.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Yejin Choi,et al.  Deep Communicating Agents for Abstractive Summarization , 2018, NAACL.

[33]  Chin-Yew Lin,et al.  Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough? , 2004, NTCIR.

[34]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xiaojun Wan,et al.  Improved Affinity Graph Based Multi-Document Summarization , 2006, NAACL.

[36]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Uwe D. Hanebeck,et al.  Template matching using fast normalized cross correlation , 2001, SPIE Defense + Commercial Sensing.