A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system that handles the task of understanding the user by capturing the necessary information in the form of slots and generating an appropriate response in accordance with the extracted information. Recently, dialogue systems integrated with complementary information such as images, audio, or video have gained immense popularity. In this work, we propose an end-to-end framework with the capability to extract necessary slot values from the utterance and generate a coherent response, thereby assisting the user to achieve their desired goals in a multimodal dialogue system having both textual and visual information. The task of extracting the necessary information is dependent not only on the text but also on the visual cues present in the dialogue. Similarly, for the generation, the previous dialog context comprising multimodal information is significant for providing coherent and informative responses. We employ a multimodal hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge base (Kb) to provide a stronger context for both the tasks. Finally, we design a slot attention mechanism to focus on the necessary information in a given utterance. Lastly, a decoder generates the corresponding response for the given dialogue context and the extracted slot values. Experimental results on the Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms the baselines approaches in both the tasks. The code is available at https://github.com/avinashsai/slot-gpt.

[1]  Gedas Bertasius,et al.  Improving video retrieval using multilingual knowledge transfer , 2022, ECIR.

[2]  Pushpak Bhattacharyya,et al.  EmoSen: Generating Sentiment and Emotion Controlled Responses in a Multimodal Dialogue System , 2022, IEEE Transactions on Affective Computing.

[3]  Avinash Madasu,et al.  A Unified Framework for Emotion Identification and Generation in Dialogues , 2022, EACL.

[4]  Junier B. Oliva,et al.  Learning to Retrieve Videos by Asking Questions , 2022, ACM Multimedia.

[5]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[6]  Asif Ekbal,et al.  More to diverse: Generating diversified responses in a task oriented multimodal dialog system , 2020, PloS one.

[7]  Asif Ekbal,et al.  MultiDM-GCN: Aspect-Guided Response Generation in Multi-Domain Multi-Modal Dialogue System using Graph Convolution Network , 2020, FINDINGS.

[8]  Enhong Chen,et al.  Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements , 2020, ACM Multimedia.

[9]  David Vandyke,et al.  A Generative Model for Joint Natural Language Understanding and Generation , 2020, ACL.

[10]  Yunjie Gu,et al.  Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System , 2020, ICLR.

[11]  Paul A. Crook,et al.  Situated and Interactive Multimodal Conversations , 2020, COLING.

[12]  Xiao Xu,et al.  Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog , 2020, ACL.

[13]  Jason Weston,et al.  All-in-One Image-Grounded Conversational Agents , 2019, ArXiv.

[14]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[15]  Qi Tian,et al.  Multimodal Dialog System: Generating Responses via Adaptive Decoders , 2019, ACM Multimedia.

[16]  Ying Zhang,et al.  Task-Oriented Conversation Generation Using Heterogeneous Memory Networks , 2019, EMNLP.

[17]  Lun-Wei Ku,et al.  Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation , 2019, ArXiv.

[18]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[19]  Chenguang Zhu,et al.  Multi-task Learning for Natural Language Generation in Task-Oriented Dialogue , 2019, EMNLP.

[20]  Chen Cui,et al.  User Attention-guided Multimodal Dialog Systems , 2019, SIGIR.

[21]  Ivan Vulić,et al.  Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems , 2019, EMNLP.

[22]  Doyen Sahoo,et al.  Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.

[23]  Pushpak Bhattacharyya,et al.  Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System , 2019, ACL.

[24]  Bo Xu,et al.  A Working Memory Model for Task-oriented Dialog Response Generation , 2019, ACL.

[25]  Wei Bi,et al.  Learning to Abstract for Memory-augmented Conversational Response Generation , 2019, ACL.

[26]  Meina Song,et al.  A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling , 2019, ACL.

[27]  Zhou Yu,et al.  Domain Adaptive Dialog Generation via Meta Learning , 2019, ACL.

[28]  Haoran Xie,et al.  End-to-End latent-variable task-oriented dialogue system with exact log-likelihood optimization , 2019, World Wide Web.

[29]  Danish Contractor,et al.  2019 Formatting Instructions for Authors Using LaTeX , 2018 .

[30]  Kyle Williams,et al.  Neural Lexicons for Slot Tagging in Spoken Language Understanding , 2019, NAACL.

[31]  Wenhu Chen,et al.  Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention , 2019, ACL.

[32]  Jin Zeng,et al.  A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications , 2019, ArXiv.

[33]  Chien-Sheng Wu,et al.  Learning to Memorize in Neural Task-Oriented Dialogue Systems , 2019, ArXiv.

[34]  Yun-Nung Chen,et al.  Dual Supervised Learning for Natural Language Understanding and Generation , 2019, ACL.

[35]  Boi Faltings,et al.  Meta-Learning for Low-resource Natural Language Generation in Task-oriented Dialogue Systems , 2019, IJCAI.

[36]  Tat-Seng Chua,et al.  Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems , 2019, WWW.

[37]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[38]  Yu Cheng,et al.  Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog , 2019, ACL.

[39]  Richard Socher,et al.  Global-to-local Memory Pointer Networks for Task-Oriented Dialogue , 2019, ICLR.

[40]  Liang Qiu,et al.  Recurrent Neural Networks with Pre-trained Language Model Embedding for Slot Filling Task , 2018, ArXiv.

[41]  Verena Rieser,et al.  A Knowledge-Grounded Multimodal Search-Based Conversational Agent , 2018, SCAI@EMNLP.

[42]  Verena Rieser,et al.  Improving Context Modelling in Multimodal Dialogue Generation , 2018, INLG.

[43]  Tat-Seng Chua,et al.  Knowledge-aware Multimodal Dialogue Systems , 2018, ACM Multimedia.

[44]  Sang-goo Lee,et al.  Slot Filling with Delexicalized Sentence Generation , 2018, INTERSPEECH.

[45]  Philip S. Yu,et al.  Zero-shot User Intent Detection via Capsule Neural Networks , 2018, EMNLP.

[46]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[47]  Rafael E. Banchs,et al.  Attention-based Semantic Priming for Slot-filling , 2018, NEWS@ACL.

[48]  Lin Zhao,et al.  Improving Slot Filling in Spoken Language Understanding with Joint Pointer and Attention , 2018, ACL.

[49]  Nikhil Gupta,et al.  Disentangling Language and Knowledge in Task-Oriented Dialogs , 2018, NAACL.

[50]  Nikhil Gupta,et al.  Hierarchical-Pointer Generator Memory Network for Task Oriented Dialog. , 2018 .

[51]  Pascale Fung,et al.  Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems , 2018, ACL.

[52]  Kai Yu,et al.  Semi-Supervised Training Using Adversarial Multi-Task Learning for Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[54]  Dilek Z. Hakkani-Tür,et al.  Scalable multi-domain dialogue state tracking , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[55]  Bing Liu,et al.  Multi-Domain Adversarial Learning for Slot Filling in Spoken Language Understanding , 2017, ArXiv.

[56]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[57]  Kai Yu,et al.  Concept Transfer Learning for Adaptive Language Understanding , 2017, SIGDIAL Conference.

[58]  Mitesh M. Khapra,et al.  Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , 2017, AAAI.

[59]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[60]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[61]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Kai Yu,et al.  Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[65]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[66]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[67]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[68]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[69]  Joelle Pineau,et al.  Hierarchical Neural Network Generative Models for Movie Dialogues , 2015, ArXiv.

[70]  Yoshua Bengio,et al.  A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion , 2015, CIKM.

[71]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[72]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[73]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[74]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[75]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[76]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[77]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[78]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[79]  Geoffrey Zweig,et al.  Recurrent conditional random field for language understanding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Ruhi Sarikaya,et al.  Deep belief network based semantic taggers for spoken language understanding , 2013, INTERSPEECH.

[81]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[82]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[83]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[84]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[85]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[86]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[87]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[88]  Tim K. Marks,et al.  Audio Visual Scene-aware dialog (AVSD) Track for Natural Language Generation in DSTC7 , 2019 .

[89]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[90]  Steven C. H. Hoi,et al.  End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features , 2018 .

[91]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[92]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[93]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.