Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems

Dialogue policy learning for task-oriented dialogue systems has enjoyed great progress recently mostly through employing reinforcement learning methods. However, these approaches have become very sophisticated. It is time to re-evaluate it. Are we really making progress developing dialogue agents only based on reinforcement learning? We demonstrate how (1)~traditional supervised learning together with (2)~a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods. First, we introduce a simple dialogue action decoder to predict the appropriate actions. Then, the traditional multi-label classification solution for dialogue policy learning is extended by adding dense layers to improve the dialogue agent performance. Finally, we employ the Gumbel-Softmax estimator to alternatively train the dialogue agent and the dialogue reward model without using reinforcement learning. Based on our extensive experimentation, we can conclude the proposed methods can achieve more stable and higher performance with fewer efforts, such as the domain knowledge required to design a user simulator and the intractable parameter tuning in reinforcement learning. Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.

[1]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[2]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[3]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[4]  Milica Gasic,et al.  Gaussian Processes for POMDP-Based Dialogue Manager Optimization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[8]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[9]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[10]  Bing Liu,et al.  Adversarial Learning of Task-Oriented Neural Dialog Models , 2018, SIGDIAL Conference.

[11]  Minlie Huang,et al.  Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[12]  Jianfeng Gao,et al.  Guided Dialog Policy Learning without Adversarial Learning in the Loop , 2020, EMNLP.

[13]  Antoine Raux,et al.  The Dialog State Tracking Challenge Series , 2014, AI Mag..

[14]  Joseph Weizenbaum,et al.  ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[15]  Sungjin Lee,et al.  ConvLab: Multi-Domain End-to-End Dialog System Platform , 2019, ACL.

[16]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[17]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[18]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[19]  Jianfeng Gao,et al.  BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[20]  Kam-Fai Wong,et al.  Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[22]  Stefan Ultes,et al.  Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[23]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[24]  David Vandyke,et al.  Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[26]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[27]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[28]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[29]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.