论文信息 - Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.

[1] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[3] Jason Weston,et al. Learning to Speak and Act in a Fantasy Text Adventure Game , 2019, EMNLP.

[4] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[5] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7] Jason Weston,et al. Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[8] Tim K. Marks,et al. Audio Visual Scene-aware dialog (AVSD) Track for Natural Language Generation in DSTC7 , 2019 .

[9] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[12] Pascale Fung,et al. Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems , 2018, ACL.

[13] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[14] Anoop Cherian,et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[16] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[17] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Yang Feng,et al. Incremental Transformer with Deliberation Decoder for Document Grounded Conversations , 2019, ACL.

[22] Florian Metze,et al. CMU Sinbad’s Submission for the DSTC7 AVSD Challenge , 2019 .

[23] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[24] Alan W. Black,et al. A Dataset for Document Grounded Conversations , 2018, EMNLP.

[25] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[26] DSTC 7-AVSD : Scene-Aware Video-Dialogue Systems with Dual Attention , 2018 .

[27] Doyen Sahoo,et al. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.

[28] J. Koenderink. Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[29] Tien Dat Nguyen,et al. From FiLM to Video: Multi-turn Question Answering with Multi-modal Context , 2018, ArXiv.

[30] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[31] Ji Wang,et al. Pretraining-Based Natural Language Generation for Text Summarization , 2019, CoNLL.

[32] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.