论文信息 - Multimodal Conversational AI: A Survey of Datasets and Approaches

Multimodal Conversational AI: A Survey of Datasets and Approaches

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.

Alec Radford | Anirudh S. Sundar | Larry Heck

[1] Mostafa Dehghani,et al. VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling , 2021, ArXiv.

[2] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] James M. Rehg,et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Björn Hoffmeister,et al. Multi-Modal Pre-Training for Automated Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[6] Li Fei-Fei,et al. ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations , 2021, CoRL.

[7] Yang Feng,et al. Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark , 2021, ArXiv.

[8] Baolin Peng,et al. Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching , 2021, Transactions of the Association for Computational Linguistics.

[9] Chongyang Bai,et al. UIBert: Learning Generic Multimodal Representations for UI Understanding , 2021, IJCAI.

[10] Ruslan Salakhutdinov,et al. Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Balaji Vasan Srinivasan,et al. MIMOQA: Multimodal Input Multimodal Output Question Answering , 2021, NAACL.

[12] Jonathan Le Roux,et al. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers , 2021, AAAI.

[13] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[14] Alborz Geramifard,et al. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations , 2021, EMNLP.

[15] Larry Heck,et al. Grounding Open-Domain Instructions to Automate Web Support Tasks , 2021, NAACL.

[16] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17] Alborz Geramifard,et al. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue , 2021, ACL.

[18] Ruby B. Lee,et al. ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces , 2020, AAAI.

[19] Larry Heck,et al. mForms : Multimodal Form Filling with Question Answering , 2020, International Conference on Language Resources and Evaluation.

[20] Andrew Zisserman,et al. A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.

[21] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[22] Christopher D. Manning,et al. Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations , 2020, ArXiv.

[23] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[25] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[26] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[27] Eric Michael Smith,et al. Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions , 2020, ArXiv.

[28] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[29] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[30] Paul A. Crook,et al. Situated and Interactive Multimodal Conversations , 2020, COLING.

[31] R. Socher,et al. A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.

[32] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Richard Socher,et al. TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue , 2020, EMNLP.

[34] Li Dong,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[35] Geoffrey Zweig,et al. Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.

[36] Shalini Ghosh,et al. Cross-modal Learning for Multi-modal Video Categorization , 2020, ArXiv.

[37] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[39] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[40] Mohit Bansal,et al. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.

[41] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Luke Zettlemoyer,et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] D. Mahajan,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[44] Aren Jansen,et al. Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Tsung-Hsien,et al. ConveRT: Efficient and Accurate Conversational Representations from Transformers , 2019, FINDINGS.

[47] Jianfeng Gao,et al. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[48] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[49] Hua Wu,et al. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2019, ACL.

[50] D. Ramanan,et al. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , 2019, ICLR.

[51] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[52] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[53] Toby Jia-Jun Li,et al. PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations , 2019, UIST.

[54] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[55] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[56] Larry P. Heck,et al. Generative Visual Dialogue System via Weighted Likelihood Estimation , 2019, IJCAI.

[57] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[58] Jesse Thomason,et al. Vision-and-Dialog Navigation , 2019, CoRL.

[59] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[61] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63] C.-C. Jay Kuo,et al. Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation , 2019, arXiv.org.

[64] Hongxia Jin,et al. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.

[67] Harry Shum,et al. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[68] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70] Antoine Bordes,et al. Image-Chat: Engaging Grounded Conversations , 2018, ACL.

[71] Y-Lan Boureau,et al. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[72] Chuang Gan,et al. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[73] Antoine Bordes,et al. Training Millions of Personalized Dialogue Agents , 2018, EMNLP.

[74] José M. F. Moura,et al. Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[75] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[76] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[77] Jianfeng Gao,et al. Neural Approaches to Conversational AI , 2018, ACL.

[78] Ross B. Girshick,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[79] Dilek Z. Hakkani-Tür,et al. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.

[80] Gökhan Tür,et al. (Almost) Zero-Shot Cross-Lingual Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81] Bing Liu,et al. End-to-End Optimization of Task-Oriented Dialogue Model with Deep Reinforcement Learning , 2017, ArXiv.

[82] Gökhan Tür,et al. Towards Zero-Shot Frame Semantic Parsing for Domain Scaling , 2017, INTERSPEECH.

[83] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[84] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85] Christopher D. Manning,et al. Key-Value Retrieval Networks for Task-Oriented Dialogue , 2017, SIGDIAL Conference.

[86] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[87] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[88] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[89] Jianfeng Gao,et al. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[90] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[91] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[94] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[95] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.

[96] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[99] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[100] Damian Borth,et al. Real-time Analysis and Visualization of the YFCC100m Dataset , 2015, MMCommons '15.

[101] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[102] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[103] Stephen Clark,et al. Grounding Semantics in Olfactory Perception , 2015, ACL.

[104] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[105] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.

[106] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[107] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[108] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[109] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[110] Hang Li,et al. Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[111] BengioYoshua,et al. Using recurrent neural networks for slot filling in spoken language understanding , 2015 .

[112] Geoffrey Zweig,et al. Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[113] Larry P. Heck,et al. Deep learning of knowledge graph embeddings for semantic parsing of Twitter dialogs , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[114] Dilek Z. Hakkani-Tür,et al. Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions , 2014, ICMI.

[115] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[116] Fei-Fei Li,et al. Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[117] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[118] Kun Duan,et al. Multimodal Learning in Loosely-Organized Web Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[119] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[120] Gökhan Tür,et al. Extending domain coverage of language understanding systems via intent transfer between domains using knowledge graphs and search query click logs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[121] Gökhan Tür,et al. Multi-Modal Conversational Search and Browse , 2013, SLAM@INTERSPEECH.

[122] Gökhan Tür,et al. Using a knowledge graph and query click logs for unsupervised learning of relation detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[123] J. Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[124] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[125] R. Salakhutdinov,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[126] Larry Heck,et al. The Conversational Web , 2012 .

[127] Dilek Z. Hakkani-Tür,et al. Exploiting the Semantic Web for unsupervised spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[128] Gökhan Tür,et al. Translating natural language utterances to search queries for SLU domain detection using query click logs , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[129] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[130] Dilek Z. Hakkani-Tür,et al. Research Challenges and Opportunities in Mobile Applications , 2011 .

[131] Alan Ritter,et al. Data-Driven Response Generation in Social Media , 2011, EMNLP.

[132] Hatice Gunes,et al. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[133] A. Ng,et al. Multimodal Deep Learning , 2011, ICML.

[134] Fei-Fei Li,et al. Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.

[135] Gökhan Tür,et al. Research Challenges and Opportunities in Mobile Applications [DSP Education] , 2011, IEEE Signal Processing Magazine.

[136] Gökhan Tür,et al. Exploiting query click logs for utterance domain detection in spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[137] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[138] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[139] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[140] Jean-Philippe Thiran,et al. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.

[141] Lianhong Cai,et al. Multi-level Fusion of Audio and Visual Features for Speaker Identification , 2006, International Conference on Biometrics.

[142] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[143] Marilyn A. Walker,et al. MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[144] Kevin P. Murphy,et al. A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[145] Li Deng,et al. MiPad: a multimodal interaction prototype , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[146] Jean-Luc Gauvain,et al. User evaluation of the MASK kiosk , 1998, Speech Commun..

[147] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[148] Paul Duchnowski,et al. Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[149] A. Waibel,et al. See me, hear me: integrating automatic speech recognition and lip-reading , 1994, ICSLP.

[150] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[151] Geoffrey E. Hinton,et al. Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[152] B.P. Yuhas,et al. Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[153] K. Bach,et al. Linguistic Communication and Speech Acts , 1983 .

[154] Josh H. McDermott,et al. Visual Learning , 1968 .

[155] Joseph Weizenbaum,et al. ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[156] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[157] Dilek Z. Hakkani-Tür,et al. Interactive reinforcement learning for task-oriented dialogue management , 2016 .

[158] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..