Actor-Double-Critic: Incorporating Model-Based Critic for Task-Oriented Dialogue Systems

In order to improve the sample-efficiency of deep reinforcement learning (DRL), we implemented imagination augmented agent (I2A) in spoken dialogue systems (SDS). Although I2A achieves a higher success rate than baselines by augmenting predicted future into a policy network, its complicated architecture introduces unwanted instability. In this work, we propose actor-double-critic (ADC) to improve the stability and overall performance of I2A. ADC simplifies the architecture of I2A to reduce excessive parameters and hyper-parameters. More importantly, a separate model-based critic shares parameters between actions and makes back-propagation explicit. In our experiments on Cambridge Restaurant Booking task, ADC enhances success rates considerably and shows robustness to imperfect environment models. In addition, ADC exhibits the stability and sample-efficiency as significantly reducing the baseline standard deviation of success rates and reaching the 80% success rate with half training data.

[1]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[2]  Stefan Ultes,et al.  A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management , 2017, ArXiv.

[3]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[4]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[5]  Jason D. Williams,et al.  The best of both worlds: unifying conventional dialog systems and POMDPs , 2008, INTERSPEECH.

[6]  Xiang Zhou,et al.  Affordable On-line Dialogue Policy Learning , 2017, EMNLP.

[7]  Stefan Ultes,et al.  Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[8]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[9]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[10]  Roberto Pieraccini,et al.  Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[11]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12]  Dilek Z. Hakkani-Tür,et al.  Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.

[13]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[14]  Enhong Chen,et al.  Budgeted Policy Learning for Task-Oriented Dialogue Systems , 2019, ACL.

[15]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Milica Gasic,et al.  On-line policy optimisation of spoken dialogue systems via live interaction with human subjects , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[18]  Milica Gasic,et al.  Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers , 2010, SIGDIAL Conference.

[19]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[20]  Matthew Henderson,et al.  N-best error simulation for training spoken dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[21]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[22]  David Vandyke,et al.  PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[23]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[24]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[25]  Tsung-Hsien Wen,et al.  Neural Belief Tracker: Data-Driven Dialogue State Tracking , 2016, ACL.

[26]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[27]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[28]  Lihong Li,et al.  Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[29]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[30]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[31]  Minlie Huang,et al.  Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[32]  Yen-Chen Wu,et al.  Improving Sample-Efficiency in Reinforcement Learning for Dialogue Systems by Using Trainable-Action-Mask , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Steve J. Young,et al.  The Hidden Agenda User Simulation Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[35]  Yiming Yang,et al.  Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning , 2018, AAAI.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.