Multimodal Dialog System: Generating Responses via Adaptive Decoders

On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they still suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model which jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.

[1]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Gary Marchionini,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[4]  Antoine Raux,et al.  The Dialog State Tracking Challenge , 2013, SIGDIAL Conference.

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[7]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[8]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[9]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Geoffrey Zweig,et al.  Attention with Intention for a Neural Network Conversation Model , 2015, ArXiv.

[12]  Sangdo Han,et al.  Exploiting knowledge base to generate responses for natural language dialog listening agents , 2015, SIGDIAL Conference.

[13]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[14]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Geoffrey Zweig,et al.  End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning , 2016, ArXiv.

[17]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[18]  Rui Yan,et al.  Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System , 2016, SIGIR.

[19]  Filip Radlinski,et al.  Towards Conversational Recommender Systems , 2016, KDD.

[20]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[21]  Tat-Seng Chua,et al.  Learning from Multiple Social Networks , 2016, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[22]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[23]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[24]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Meng Wang,et al.  Towards Micro-video Understanding by Joint Sequential-Sparse Modeling , 2017, ACM Multimedia.

[26]  Matthew R. Walter,et al.  Coherent Dialogue with Attention-Based Language Models , 2016, AAAI.

[27]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[28]  Zhen Xu,et al.  Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[29]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Zhoujun Li,et al.  Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots , 2016, ArXiv.

[31]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dongyan Zhao,et al.  Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System , 2017, SIGIR.

[33]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[34]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[35]  Mitesh M. Khapra,et al.  Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , 2017, AAAI.

[36]  Chong-Wah Ngo,et al.  Interpretable Multimodal Retrieval for Fashion Products , 2018, ACM Multimedia.

[37]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[38]  Yi Zhang,et al.  Conversational Recommender System , 2018, SIGIR.

[39]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[40]  Jun Huang,et al.  Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems , 2018, SIGIR.

[41]  Deng Cai,et al.  Dialogue Act Recognition via CRF-Attentive Structured Network , 2017, SIGIR.

[42]  Christopher Joseph Pal,et al.  Towards Deep Conversational Recommendations , 2018, NeurIPS.

[43]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[44]  Tat-Seng Chua,et al.  Knowledge-aware Multimodal Dialogue Systems , 2018, ACM Multimedia.

[45]  Fumin Shen,et al.  Chat More: Deepening and Widening the Chatting Topic via A Deep Model , 2018, SIGIR.

[46]  Chen Cui,et al.  User Attention-guided Multimodal Dialog Systems , 2019, SIGIR.