Multimodal Video Summarization via Time-Aware Transformers
暂无分享,去创建一个
Zehuan Yuan | Xindi Shang | Changhu Wang | Anran Wang | Changhu Wang | Zehuan Yuan | Xindi Shang | Anran Wang
[1] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[3] Dragomir R. Radev,et al. A Low-Rank Approximation Approach to Learning Joint Embeddings of News Stories and Images for Timeline Summarization , 2016, HLT-NAACL.
[4] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[5] Hung-yi Lee,et al. Seeing and hearing too: Audio representation for video captioning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[7] Xu Sun,et al. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts , 2018, AAAI.
[8] Marcus Rohrbach,et al. Multimodal Video Description , 2016, ACM Multimedia.
[9] Haoran Li,et al. Multi-modal Sentence Summarization with Modality Attention and Image Filtering , 2018, IJCAI.
[10] Yu Zhou,et al. Multimodal Summarization with Guidance of Multimodal Reference , 2020, AAAI.
[11] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[12] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[13] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[14] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[16] Bernt Schiele,et al. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[17] Kun Kuang,et al. DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.
[18] Yang Yang,et al. Multimedia summarization for trending topics in microblogs , 2013, CIKM.
[19] Yongdong Zhang,et al. Learning Multimodal Attention LSTM Networks for Video Captioning , 2017, ACM Multimedia.
[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[21] Wei Liu,et al. Video Description , 2018, ACM Comput. Surv..
[22] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[23] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.
[24] Petros Maragos,et al. Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.
[25] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[26] Yang Yang,et al. Multimedia Summarization for Social Events in Microblog Stream , 2015, IEEE Transactions on Multimedia.
[27] A. Wasilewska. Artificial Intelligence , 2018, Communications in Computer and Information Science.
[28] Hai Zhuge,et al. Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN , 2018, EMNLP.
[29] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.
[30] Alexander G. Hauptmann,et al. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.
[31] Ming Zhou,et al. Dense Procedure Captioning in Narrated Instructional Videos , 2019, ACL.
[32] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[34] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.
[36] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] In-So Kweon,et al. Global-and-Local Relative Position Embedding for Unsupervised Video Summarization , 2020, ECCV.
[38] Shih-Fu Chang,et al. Learning Visual Commonsense for Robust Scene Graph Generation: Supplementary Material , 2020 .
[39] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[40] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Tong Zhang,et al. Modeling Localness for Self-Attention Networks , 2018, EMNLP.
[42] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[43] Tat-Seng Chua,et al. Multi-Perspective Video Captioning , 2021, ACM Multimedia.
[44] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[45] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[47] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[48] Haoran Li,et al. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.
[49] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[50] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.
[51] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.
[52] Florian Metze,et al. Multimodal Abstractive Summarization for How2 Videos , 2019, ACL.
[53] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[54] Zhou Su,et al. Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Harry W. Agius,et al. Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..
[56] Luc Van Gool,et al. Viewpoint-Aware Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[57] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[59] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Ramakanth Pasunuru,et al. Game-Based Video-Context Dialogue , 2018, EMNLP.
[61] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[62] Zhaoxiang Zhang,et al. Integrating both Visual and Audio Cues for Enhanced Video Caption , 2017, AAAI.
[63] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[64] Matthias Sperber,et al. Self-Attentional Acoustic Models , 2018, INTERSPEECH.
[65] Wenkai Zhang,et al. Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos , 2020, EMNLP.
[66] S. Chitrakala,et al. A survey on extractive text summarization , 2017, 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP).
[67] Radu Soricut,et al. A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.