Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

[1]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[2]  Mike Lewis,et al.  Hierarchical Text Generation and Planning for Strategic Dialogue , 2017, ICML.

[3]  David Vandyke,et al.  Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Maxine Eskénazi,et al.  Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning , 2016, SIGDIAL Conference.

[5]  Bing Liu,et al.  Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.

[6]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[7]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[8]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[9]  Xiang Zhou,et al.  Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning , 2017, EMNLP.

[10]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[11]  Abdeslam Boularias,et al.  Learning the Reward Model of Dialogue POMDPs from Data , 2010 .

[12]  Hao Tian,et al.  Policy Learning for Domain Selection in an Extensible Multi-domain Spoken Dialogue System , 2014, EMNLP.

[13]  Antoine Raux,et al.  The Dialog State Tracking Challenge Series: A Review , 2016, Dialogue Discourse.

[14]  Sungjin Lee,et al.  ConvLab: Multi-Domain End-to-End Dialog System Platform , 2019, ACL.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Stefan Ultes,et al.  Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning , 2017, SIGDIAL Conference.

[17]  Feng Ji,et al.  Memory-Augmented Dialogue Management for Task-Oriented Dialogue Systems , 2018, ACM Trans. Inf. Syst..

[18]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[19]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[20]  Gökhan Tür,et al.  User Modeling for Task Oriented Dialogues , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[22]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[23]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[24]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[25]  He He,et al.  Temporal supervised learning for inferring a dialog policy from example conversations , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[27]  Lina Maria Rojas-Barahona,et al.  Bayesian Inverse Reinforcement Learning for Modeling Conversational Agents in a Virtual Environment , 2014, CICLing.

[28]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[29]  Stefan Ultes,et al.  Feudal Reinforcement Learning for Dialogue Management in Large Domains , 2018, NAACL.

[30]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[31]  Bing Liu,et al.  Adversarial Learning of Task-Oriented Neural Dialog Models , 2018, SIGDIAL Conference.

[32]  Seunghak Yu,et al.  Scaling up deep reinforcement learning for multi-domain dialogue systems , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[33]  Zhou Yu,et al.  Sentiment Adaptive End-to-End Dialog Systems , 2018, ACL.

[34]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[35]  Derek Chen,et al.  Decoupling Strategy and Generation in Negotiation Dialogues , 2018, EMNLP.

[36]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[37]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.