Natural Actor and Belief Critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs

This paper presents a novel algorithm for learning parameters in statistical dialogue systems which are modelled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy which selects the system’s responses based on the inferred state; and a reward function which species the desired behaviour of the system. Ideally both the model parameters and the policy would be designed to maximise the cumulative reward. However, whilst there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet. The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method which oers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the paper presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is xed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximise the expected cumulative reward result in signicantly improved performance compared to the baseline handcrafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work signicantly better.

[1]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[2]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[3]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[4]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[5]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[6]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[7]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Jason Williams,et al.  Using Automatically Transcribed Dialogs to Learn User Models in a Spoken Dialog System , 2008, ACL.

[10]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[11]  Maxine Eskénazi,et al.  Let's go public! taking a spoken dialog system to the real world , 2005, INTERSPEECH.

[12]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13]  Blaise Roger Marie Thomson,et al.  Statistical methods for spoken dialogue management , 2013 .

[14]  J.D. Williams,et al.  Scaling up POMDPs for Dialog Management: The ``Summary POMDP'' Method , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[15]  Steve J. Young,et al.  Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems , 2010, Comput. Speech Lang..

[16]  Milica Gasic,et al.  Natural belief-critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems , 2010, INTERSPEECH.

[17]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[18]  Jason Williams Demonstration of a POMDP Voice Dialer , 2008, ACL.

[19]  Matthieu Geist,et al.  Kalman Temporal Differences: The deterministic case , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[20]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[21]  Jason D. Williams Integrating expert knowledge into POMDP optimization for spoken dialog systems , 2008 .

[22]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[23]  J. Schatztnann,et al.  Effects of the user model on simulation-based learning of dialogue strategies , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[24]  Kallirroi Georgila,et al.  Automatic annotation of COMMUNICATOR dialogue data for learning dialogue strategies and user simulations , 2005 .

[25]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[26]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[27]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[28]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[29]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[30]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[31]  Hyeong Seop Sim,et al.  Effects of user modeling on POMDP-based dialogue systems , 2008, INTERSPEECH.

[32]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[33]  Baining Guo,et al.  Planning and Acting under Uncertainty: A New Model for Spoken Dialogue System , 2001, UAI.

[34]  Nicholas Roy,et al.  Efficient model learning for dialog management , 2007, 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI).