Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content. Multimodal news report contains heterogeneous contents, which makes MSMO nontrivial. Moreover, it is observed that different modalities of data in the news report correlate hierarchically. Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data, which is not directly adaptable to the heterogeneous contents and hierarchical correlation. In this paper, we propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intraand intermodal correlation existing in the multimodal data. HCSCL adopts a graph network to encode the intra-modal correlation. Then, a hierarchical fusion framework is proposed to learn the hierarchical correlation between text and images. Furthermore, we construct a new dataset with relevant image annotation and image object label information to provide the supervision information for the learning procedure. Extensive experiments on the dataset show that HCSCL significantly outperforms the baseline methods in automatic summarization metrics and fine-grained diversity tests.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Giuseppe Carenini,et al.  Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[3]  Jianfei Cai,et al.  Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning , 2020, ACM Multimedia.

[4]  Wenkai Zhang,et al.  Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos , 2020, EMNLP.

[5]  Florian Metze,et al.  Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.

[6]  Jeffrey P. Bigham,et al.  Multimodal summarization of complex sentences , 2011, IUI '11.

[7]  Xiaozhe Ren,et al.  NEZHA: Neural Contextualized Representation for Chinese Language Understanding , 2019, ArXiv.

[8]  Mirella Lapata,et al.  Neural Latent Extractive Document Summarization , 2018, EMNLP.

[9]  Xiaojun Wan,et al.  Overview of the NLPCC 2017 Shared Task: Single Document Summarization , 2017, NLPCC.

[10]  Dongyan Zhao,et al.  Abstractive Text Summarization by Incorporating Reader Comments , 2018, AAAI.

[11]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[12]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[13]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[14]  Jun Wang,et al.  Multi-modal Summarization for Video-containing Documents , 2020, ArXiv.

[15]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Xin Liu,et al.  LCQMC:A Large-scale Chinese Question Matching Corpus , 2018, COLING.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Haoran Li,et al.  Multi-modal Sentence Summarization with Modality Attention and Image Filtering , 2018, IJCAI.

[20]  Yu Zhou,et al.  Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Hai Zhuge,et al.  Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN , 2018, EMNLP.

[23]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[25]  Dongyan Zhao,et al.  VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles , 2020, EMNLP.

[26]  Jiajun Zhang,et al.  Multimodal Sentence Summarization via Multimodal Selective Encoding , 2020, COLING.

[27]  Aman Khullar,et al.  MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention , 2020, NLPBT.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[30]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.