UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and visual modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on qualified image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[3]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Li Deng,et al.  Deep Learning for Image-to-Text Generation: A Technical Overview , 2017, IEEE Signal Processing Magazine.

[5]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[6]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[7]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[9]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[12]  Yu Zhou,et al.  Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.

[13]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[14]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[15]  Rohini K. Srihari,et al.  Computational models for integrating linguistic and visual information: A survey , 2004, Artificial Intelligence Review.

[16]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[17]  Yanan Cao,et al.  Flexible Non-Autoregressive Extractive Summarization with Threshold: How to Extract a Non-Fixed Number of Summary Sentences , 2021, AAAI.

[18]  Pengfei Liu,et al.  GSum: A General Framework for Guided Neural Abstractive Summarization , 2021, NAACL.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Jun Wang,et al.  LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract) , 2021, AAAI.

[21]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[23]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[25]  Furu Wei,et al.  Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[26]  Ildoo Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.