论文信息 - Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning - 字舞流文

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

This paper presents a new method - adversarial advantage actor-critic (Adversarial A2C), which significantly improves the efficiency of dialogue policy learning in task-completion dialogue systems. Inspired by generative adversarial networks (GAN), we train a discriminator to differentiate responses/actions generated by dialogue agents from responses/actions by experts. Then, we incorporate the discriminator as another critic into the advantage actor-critic (A2C) framework, to encourage the dialogue agent to explore state-action within the regions where the agent takes actions similar to those of the experts. Experimental results in a movie-ticket booking domain show that the proposed Adversarial A2C can accelerate policy exploration efficiently.

Kam-Fai Wong | Jianfeng Gao | Xiujun Li | Jingjing Liu | Baolin Peng | Yun-Nung Chen

[1] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[2] David Vandyke,et al. Continuously Learning Neural Dialogue Management , 2016, ArXiv.

[3] Jianfeng Gao,et al. End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[4] Jianfeng Gao,et al. BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[5] Maxine Eskénazi,et al. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning , 2016, SIGDIAL Conference.

[6] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[7] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[8] Jianfeng Gao,et al. A User Simulator for Task-Completion Dialogues , 2016, ArXiv.

[9] Milica Gasic,et al. POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[10] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[11] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[12] Shakir Mohamed,et al. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[13] David Vandyke,et al. Reward Shaping with Recurrent Neural Networks for Speeding up On-Line Policy Learning in Spoken Dialogue Systems , 2015, SIGDIAL Conference.

[14] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15] Kam-Fai Wong,et al. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[16] Zachary Chase Lipton,et al. Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[17] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[18] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[19] Geoffrey Zweig,et al. Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[20] Jianfeng Gao,et al. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[21] Kam-Fai Wong,et al. Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[22] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[23] Kam-Fai Wong,et al. Composite Task-Completion Dialogue System via Hierarchical Deep Reinforcement Learning , 2017, ArXiv.

[24] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[25] Jing He,et al. Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[26] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[27] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.