论文信息 - Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management - 字舞流文

Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management

Traditional dialogue policy needs to be trained independently for each dialogue task. In this work, we aim to solve a collection of independent dialogue tasks using a unified dialogue agent. The unified policy is parallelly trained using the conversation data from all of the distributed dialogue tasks. However, there are two key challenges:(1) the design of a unified dialogue model to adapt to different dialogue tasks; (2) finding a robust reinforcement learning method to keep the efficiency and the stability of the training process. Here we propose a novel structured actor-critic approach to implement structured deep reinforcement learning (DRL), which not only can learn parallelly from data of different dialogue tasks but also achieves stable and sample-efficient learning. We demonstrate the effectiveness of the proposed approach on 18 tasks of PyDial benchmark. The results show that our method is able to achieve state-of-the-art performance.

Kai Yu | Zhi Chen | Lu Chen | Xiaoyuan Liu | Kai Yu | Lu Chen | Xiaoyuan Liu | Zhi Chen

[1] Jianfeng Gao,et al. End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[2] Milica Gasic,et al. The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[3] Jing He,et al. Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[4] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[5] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[6] Yannis Stylianou,et al. Single-Model Multi-domain Dialogue Management with Deep Learning , 2017, IWSDS.

[7] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[8] E. Ionides. Truncated Importance Sampling , 2008 .

[9] Steve J. Young,et al. Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs , 2011, TSLP.

[10] Pei-Hao Su,et al. Sample Efficient Deep Reinforcement Learning for Dialogue Systems With Large Action Spaces , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12] David Vandyke,et al. PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[13] Stefan Ultes,et al. Feudal Reinforcement Learning for Dialogue Management in Large Domains , 2018, NAACL.

[14] Chong Wang,et al. Subgoal Discovery for Hierarchical Dialogue Policy Learning , 2018, EMNLP.

[15] Jianfeng Gao,et al. Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[16] Zhiyuan Liu,et al. Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[17] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18] Jun S. Liu,et al. Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[19] Stefan Ultes,et al. Feudal Dialogue Management with Jointly Learned Feature Extractors , 2018, SIGDIAL Conference.

[20] Yannis Stylianou,et al. Learning Domain-Independent Dialogue Policies via Ontology Parameterisation , 2015, SIGDIAL Conference.

[21] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[22] Stefan Ultes,et al. A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management , 2017, ArXiv.

[23] Zachary Chase Lipton,et al. Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[24] Tanja Schultz,et al. Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers , 2007, HLT-NAACL 2007.

[25] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[26] Stefan Ultes,et al. Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning , 2017, SIGDIAL Conference.

[27] Lu Chen,et al. Structured Dialogue Policy with Graph Neural Networks , 2018, COLING.

[28] Kam-Fai Wong,et al. Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[30] Hui Ye,et al. Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[31] Stefan Ultes,et al. Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[32] Philip S. Yu,et al. A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[33] Jianfeng Gao,et al. BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[34] Radford M. Neal. Annealed importance sampling , 1998, Stat. Comput..

[35] Timothy Baldwin,et al. Semi-supervised User Geolocation via Graph Convolutional Networks , 2018, ACL.

[36] Doina Precup,et al. Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[37] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[38] Kam-Fai Wong,et al. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[39] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[40] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41] Dongho Kim,et al. Distributed dialogue policies for multi-domain statistical dialogue management , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42] Xiang Zhou,et al. Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning , 2017, EMNLP.