Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management

Traditional dialogue policy needs to be trained independently for each dialogue task. In this work, we aim to solve a collection of independent dialogue tasks using a unified dialogue agent. The unified policy is parallelly trained using the conversation data from all of the distributed dialogue tasks. However, there are two key challenges:(1) the design of a unified dialogue model to adapt to different dialogue tasks; (2) finding a robust reinforcement learning method to keep the efficiency and the stability of the training process. Here we propose a novel structured actor-critic approach to implement structured deep reinforcement learning (DRL), which not only can learn parallelly from data of different dialogue tasks but also achieves stable and sample-efficient learning. We demonstrate the effectiveness of the proposed approach on 18 tasks of PyDial benchmark. The results show that our method is able to achieve state-of-the-art performance.

[1]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[2]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[3]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[4]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[5]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[6]  Yannis Stylianou,et al.  Single-Model Multi-domain Dialogue Management with Deep Learning , 2017, IWSDS.

[7]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[8]  E. Ionides Truncated Importance Sampling , 2008 .

[9]  Steve J. Young,et al.  Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs , 2011, TSLP.

[10]  Pei-Hao Su,et al.  Sample Efficient Deep Reinforcement Learning for Dialogue Systems With Large Action Spaces , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  David Vandyke,et al.  PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[13]  Stefan Ultes,et al.  Feudal Reinforcement Learning for Dialogue Management in Large Domains , 2018, NAACL.

[14]  Chong Wang,et al.  Subgoal Discovery for Hierarchical Dialogue Policy Learning , 2018, EMNLP.

[15]  Jianfeng Gao,et al.  Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[16]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Jun S. Liu,et al.  Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[19]  Stefan Ultes,et al.  Feudal Dialogue Management with Jointly Learned Feature Extractors , 2018, SIGDIAL Conference.

[20]  Yannis Stylianou,et al.  Learning Domain-Independent Dialogue Policies via Ontology Parameterisation , 2015, SIGDIAL Conference.

[21]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[22]  Stefan Ultes,et al.  A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management , 2017, ArXiv.

[23]  Zachary Chase Lipton,et al.  Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[24]  Tanja Schultz,et al.  Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers , 2007, HLT-NAACL 2007.

[25]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[26]  Stefan Ultes,et al.  Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning , 2017, SIGDIAL Conference.

[27]  Lu Chen,et al.  Structured Dialogue Policy with Graph Neural Networks , 2018, COLING.

[28]  Kam-Fai Wong,et al.  Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[30]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[31]  Stefan Ultes,et al.  Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[32]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Jianfeng Gao,et al.  BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[34]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[35]  Timothy Baldwin,et al.  Semi-supervised User Geolocation via Graph Convolutional Networks , 2018, ACL.

[36]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[37]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[38]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[39]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Dongho Kim,et al.  Distributed dialogue policies for multi-domain statistical dialogue management , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Xiang Zhou,et al.  Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning , 2017, EMNLP.