论文信息 - Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog - 字舞流文

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

Minlie Huang | Ryuichi Takanobu | Hanlin Zhu | Minlie Huang | Hanlin Zhu | Ryuichi Takanobu

[1] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[2] Mike Lewis,et al. Hierarchical Text Generation and Planning for Strategic Dialogue , 2017, ICML.

[3] David Vandyke,et al. Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4] Maxine Eskénazi,et al. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning , 2016, SIGDIAL Conference.

[5] Bing Liu,et al. Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.

[6] Min-Yen Kan,et al. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[7] Hui Ye,et al. Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[8] Sergey Levine,et al. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[9] Xiang Zhou,et al. Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning , 2017, EMNLP.

[10] David Vandyke,et al. On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[11] Abdeslam Boularias,et al. Learning the Reward Model of Dialogue POMDPs from Data , 2010 .

[12] Hao Tian,et al. Policy Learning for Domain Selection in an Extensible Multi-domain Spoken Dialogue System , 2014, EMNLP.

[13] Antoine Raux,et al. The Dialog State Tracking Challenge Series: A Review , 2016, Dialogue Discourse.

[14] Sungjin Lee,et al. ConvLab: Multi-Domain End-to-End Dialog System Platform , 2019, ACL.

[15] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16] Stefan Ultes,et al. Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning , 2017, SIGDIAL Conference.

[17] Feng Ji,et al. Memory-Augmented Dialogue Management for Task-Oriented Dialogue Systems , 2018, ACM Trans. Inf. Syst..

[18] Kam-Fai Wong,et al. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[19] Andreas Stolcke,et al. Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[20] Gökhan Tür,et al. User Modeling for Task Oriented Dialogues , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[22] Joelle Pineau,et al. Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[23] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[24] Marilyn A. Walker,et al. PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[25] He He,et al. Temporal supervised learning for inferring a dialog policy from example conversations , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26] Jianfeng Gao,et al. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[27] Lina Maria Rojas-Barahona,et al. Bayesian Inverse Reinforcement Learning for Modeling Conversational Agents in a Virtual Environment , 2014, CICLing.

[28] Anind K. Dey,et al. Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[29] Stefan Ultes,et al. Feudal Reinforcement Learning for Dialogue Management in Large Domains , 2018, NAACL.

[30] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[31] Bing Liu,et al. Adversarial Learning of Task-Oriented Neural Dialog Models , 2018, SIGDIAL Conference.

[32] Seunghak Yu,et al. Scaling up deep reinforcement learning for multi-domain dialogue systems , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[33] Zhou Yu,et al. Sentiment Adaptive End-to-End Dialog Systems , 2018, ACL.

[34] Jing He,et al. Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[35] Derek Chen,et al. Decoupling Strategy and Generation in Negotiation Dialogues , 2018, EMNLP.

[36] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.

[37] Jianfeng Gao,et al. Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.