Improving Sample-Efficiency in Reinforcement Learning for Dialogue Systems by Using Trainable-Action-Mask

By interacting with human and learning from reward signals, reinforcement learning is an ideal way to build conversational AI. Concerning the expenses of real-users’ responses, improving sample-efficiency has been the key issue when applying reinforcement learning in real-world spoken dialogue systems (SDS). Handcrafted action masks are commonly used to rule out impossible actions and accelerate the training process. However, the handcrafted action mask can barely be generalized to unseen domains. In this paper, we propose trainable-action-mask (TAM) which learns from data automatically without handcrafting complicated rules. In our experiments in Cambridge Restaurant domain, TAM requires only 30% of training data, compared with the baseline, to reach the 80% success rate and it also shows robustness to noisy environments.

[1]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[2]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[3]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[4]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[5]  David Vandyke,et al.  PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[6]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[7]  Steve J. Young,et al.  The Hidden Agenda User Simulation Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[9]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[10]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[11]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[12]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[13]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Stefan Ultes,et al.  A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management , 2017, ArXiv.

[15]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[16]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[17]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[18]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19]  Milica Gasic,et al.  On-line policy optimisation of spoken dialogue systems via live interaction with human subjects , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.