Multimodal Conversational AI: A Survey of Datasets and Approaches
暂无分享,去创建一个
[1] Mostafa Dehghani,et al. VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling , 2021, ArXiv.
[2] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] James M. Rehg,et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Björn Hoffmeister,et al. Multi-Modal Pre-Training for Automated Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[5] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[6] Li Fei-Fei,et al. ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations , 2021, CoRL.
[7] Yang Feng,et al. Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark , 2021, ArXiv.
[8] Baolin Peng,et al. Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching , 2021, Transactions of the Association for Computational Linguistics.
[9] Chongyang Bai,et al. UIBert: Learning Generic Multimodal Representations for UI Understanding , 2021, IJCAI.
[10] Ruslan Salakhutdinov,et al. Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[11] Balaji Vasan Srinivasan,et al. MIMOQA: Multimodal Input Multimodal Output Question Answering , 2021, NAACL.
[12] Jonathan Le Roux,et al. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers , 2021, AAAI.
[13] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[14] Alborz Geramifard,et al. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations , 2021, EMNLP.
[15] Larry Heck,et al. Grounding Open-Domain Instructions to Automate Web Support Tasks , 2021, NAACL.
[16] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[17] Alborz Geramifard,et al. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue , 2021, ACL.
[18] Ruby B. Lee,et al. ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces , 2020, AAAI.
[19] Larry Heck,et al. mForms : Multimodal Form Filling with Question Answering , 2020, International Conference on Language Resources and Evaluation.
[20] Andrew Zisserman,et al. A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.
[21] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.
[22] Christopher D. Manning,et al. Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations , 2020, ArXiv.
[23] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[25] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[26] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[27] Eric Michael Smith,et al. Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions , 2020, ArXiv.
[28] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.
[29] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.
[30] Paul A. Crook,et al. Situated and Interactive Multimodal Conversations , 2020, COLING.
[31] R. Socher,et al. A Simple Language Model for Task-Oriented Dialogue , 2020, Neural Information Processing Systems.
[32] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Richard Socher,et al. TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue , 2020, EMNLP.
[34] Li Dong,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[35] Geoffrey Zweig,et al. Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.
[36] Shalini Ghosh,et al. Cross-modal Learning for Multi-modal Video Categorization , 2020, ArXiv.
[37] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[39] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.
[40] Mohit Bansal,et al. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.
[41] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Luke Zettlemoyer,et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] D. Mahajan,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[44] Aren Jansen,et al. Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[45] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Tsung-Hsien,et al. ConveRT: Efficient and Accurate Conversational Representations from Transformers , 2019, FINDINGS.
[47] Jianfeng Gao,et al. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.
[48] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[49] Hua Wu,et al. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable , 2019, ACL.
[50] D. Ramanan,et al. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , 2019, ICLR.
[51] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[52] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[53] Toby Jia-Jun Li,et al. PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations , 2019, UIST.
[54] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[55] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[56] Larry P. Heck,et al. Generative Visual Dialogue System via Weighted Likelihood Estimation , 2019, IJCAI.
[57] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.
[58] Jesse Thomason,et al. Vision-and-Dialog Navigation , 2019, CoRL.
[59] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[60] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.
[61] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[62] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[63] C.-C. Jay Kuo,et al. Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation , 2019, arXiv.org.
[64] Hongxia Jin,et al. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[65] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.
[67] Harry Shum,et al. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.
[68] Peng Gao,et al. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[70] Antoine Bordes,et al. Image-Chat: Engaging Grounded Conversations , 2018, ACL.
[71] Y-Lan Boureau,et al. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.
[72] Chuang Gan,et al. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.
[73] Antoine Bordes,et al. Training Millions of Personalized Dialogue Agents , 2018, EMNLP.
[74] José M. F. Moura,et al. Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.
[75] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[76] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.
[77] Jianfeng Gao,et al. Neural Approaches to Conversational AI , 2018, ACL.
[78] Ross B. Girshick,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[79] Dilek Z. Hakkani-Tür,et al. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.
[80] Gökhan Tür,et al. (Almost) Zero-Shot Cross-Lingual Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[81] Bing Liu,et al. End-to-End Optimization of Task-Oriented Dialogue Model with Deep Reinforcement Learning , 2017, ArXiv.
[82] Gökhan Tür,et al. Towards Zero-Shot Frame Semantic Parsing for Domain Scaling , 2017, INTERSPEECH.
[83] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[84] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[85] Christopher D. Manning,et al. Key-Value Retrieval Networks for Task-Oriented Dialogue , 2017, SIGDIAL Conference.
[86] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[87] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[88] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[89] Jianfeng Gao,et al. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.
[90] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[91] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[92] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[93] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[94] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[95] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.
[96] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[97] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[98] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[99] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[100] Damian Borth,et al. Real-time Analysis and Visualization of the YFCC100m Dataset , 2015, MMCommons '15.
[101] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.
[102] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.
[103] Stephen Clark,et al. Grounding Semantics in Olfactory Perception , 2015, ACL.
[104] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.
[105] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.
[106] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[107] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.
[108] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[109] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[110] Hang Li,et al. Neural Responding Machine for Short-Text Conversation , 2015, ACL.
[111] BengioYoshua,et al. Using recurrent neural networks for slot filling in spoken language understanding , 2015 .
[112] Geoffrey Zweig,et al. Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[113] Larry P. Heck,et al. Deep learning of knowledge graph embeddings for semantic parsing of Twitter dialogs , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).
[114] Dilek Z. Hakkani-Tür,et al. Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions , 2014, ICMI.
[115] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[116] Fei-Fei Li,et al. Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.
[117] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[118] Kun Duan,et al. Multimodal Learning in Loosely-Organized Web Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[119] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[120] Gökhan Tür,et al. Extending domain coverage of language understanding systems via intent transfer between domains using knowledge graphs and search query click logs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[121] Gökhan Tür,et al. Multi-Modal Conversational Search and Browse , 2013, SLAM@INTERSPEECH.
[122] Gökhan Tür,et al. Using a knowledge graph and query click logs for unsupervised learning of relation detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[123] J. Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[124] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[125] R. Salakhutdinov,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..
[126] Larry Heck,et al. The Conversational Web , 2012 .
[127] Dilek Z. Hakkani-Tür,et al. Exploiting the Semantic Web for unsupervised spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).
[128] Gökhan Tür,et al. Translating natural language utterances to search queries for SLU domain detection using query click logs , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[129] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[130] Dilek Z. Hakkani-Tür,et al. Research Challenges and Opportunities in Mobile Applications , 2011 .
[131] Alan Ritter,et al. Data-Driven Response Generation in Social Media , 2011, EMNLP.
[132] Hatice Gunes,et al. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.
[133] A. Ng,et al. Multimodal Deep Learning , 2011, ICML.
[134] Fei-Fei Li,et al. Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.
[135] Gökhan Tür,et al. Research Challenges and Opportunities in Mobile Applications [DSP Education] , 2011, IEEE Signal Processing Magazine.
[136] Gökhan Tür,et al. Exploiting query click logs for utterance domain detection in spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[137] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.
[138] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[139] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[140] Jean-Philippe Thiran,et al. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.
[141] Lianhong Cai,et al. Multi-level Fusion of Audio and Visual Features for Speaker Identification , 2006, International Conference on Biometrics.
[142] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.
[143] Marilyn A. Walker,et al. MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.
[144] Kevin P. Murphy,et al. A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[145] Li Deng,et al. MiPad: a multimodal interaction prototype , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[146] Jean-Luc Gauvain,et al. User evaluation of the MASK kiosk , 1998, Speech Commun..
[147] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.
[148] Paul Duchnowski,et al. Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[149] A. Waibel,et al. See me, hear me: integrating automatic speech recognition and lip-reading , 1994, ICSLP.
[150] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.
[151] Geoffrey E. Hinton,et al. Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.
[152] B.P. Yuhas,et al. Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.
[153] K. Bach,et al. Linguistic Communication and Speech Acts , 1983 .
[154] Josh H. McDermott,et al. Visual Learning , 1968 .
[155] Joseph Weizenbaum,et al. ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.
[156] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[157] Dilek Z. Hakkani-Tür,et al. Interactive reinforcement learning for task-oriented dialogue management , 2016 .
[158] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..