Multimodal Video Summarization via Time-Aware Transformers

With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, summaries of videos become important. The prior works in multimodal video summarization mainly explore visual and ASR tokens as two separate sources and struggle to fuse the multimodal information for generating the summaries. However, the time information inside videos is commonly ignored. In this paper, we find that it is important to leverage the timestamps to accurately incorporate multimodal signals for the task. We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive attention mechanism. The attention mechanism can attend the inputs differently based on time difference to explore the time information inherent inside video more thoroughly. As such, TAMT can fuse the different modalities better for summarizing the videos. Experiments show that our proposed approach is effective and achieves the state-of-the-art performances on both YouCookII and open-domain How2 datasets.

[1]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3]  Dragomir R. Radev,et al.  A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization , 2016, HLT-NAACL.

[4]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[5]  Hung-yi Lee,et al.  Seeing and hearing too: Audio representation for video captioning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Xu Sun,et al.  LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts , 2018, AAAI.

[8]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[9]  Haoran Li,et al.  Multi-modal Sentence Summarization with Modality Attention and Image Filtering , 2018, IJCAI.

[10]  Yu Zhou,et al.  Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.

[11]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[13]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Kun Kuang,et al.  DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.

[18]  Yang Yang,et al.  Multimedia summarization for trending topics in microblogs , 2013, CIKM.

[19]  Yongdong Zhang,et al.  Learning Multimodal Attention LSTM Networks for Video Captioning , 2017, ACM Multimedia.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Wei Liu,et al.  Video Description , 2018, ACM Comput. Surv..

[22]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[23]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[24]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[25]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Yang Yang,et al.  Multimedia Summarization for Social Events in Microblog Stream , 2015, IEEE Transactions on Multimedia.

[27]  A. Wasilewska Artificial Intelligence , 2018, Communications in Computer and Information Science.

[28]  Hai Zhuge,et al.  Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN , 2018, EMNLP.

[29]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[30]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[31]  Ming Zhou,et al.  Dense Procedure Captioning in Narrated Instructional Videos , 2019, ACL.

[32]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[36]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  In-So Kweon,et al.  Global-and-Local Relative Position Embedding for Unsupervised Video Summarization , 2020, ECCV.

[38]  Shih-Fu Chang,et al.  Learning Visual Commonsense for Robust Scene Graph Generation: Supplementary Material , 2020 .

[39]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[42]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Tat-Seng Chua,et al.  Multi-Perspective Video Captioning , 2021, ACM Multimedia.

[44]  Florian Metze,et al.  How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[45]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Haoran Li,et al.  Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.

[49]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[50]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[51]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[52]  Florian Metze,et al.  Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.

[53]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Zhou Su,et al.  Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[56]  Luc Van Gool,et al.  Viewpoint-Aware Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[59]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ramakanth Pasunuru,et al.  Game-Based Video-Context Dialogue , 2018, EMNLP.

[61]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Zhaoxiang Zhang,et al.  Integrating both Visual and Audio Cues for Enhanced Video Caption , 2017, AAAI.

[63]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[64]  Matthias Sperber,et al.  Self-Attentional Acoustic Models , 2018, INTERSPEECH.

[65]  Wenkai Zhang,et al.  Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos , 2020, EMNLP.

[66]  S. Chitrakala,et al.  A survey on extractive text summarization , 2017, 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP).

[67]  Radu Soricut,et al.  A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.