Approximate Bayes Optimal Policy Search using Neural Networks

Bayesian Reinforcement Learning (BRL) agents aim to maximise the expected collected rewards obtained when interacting with an unknown Markov Decision Process (MDP) while using some prior knowledge. State-of-the-art BRL agents rely on frequent updates of the belief on the MDP, as new observations of the environment are made. This offers theoretical guarantees to converge to an optimum, but is computationally intractable, even on small-scale problems. In this paper, we present a method that circumvents this issue by training a parametric policy able to recommend an action directly from raw observations. Artificial Neural Networks (ANNs) are used to represent this policy, and are trained on the trajectories sampled from the prior. The trained model is then used online, and is able to act on the real MDP at a very low computational cost. Our new algorithm shows strong empirical performance, on a wide range of test problems, and is robust to inaccuracies of the prior distribution.

[1]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[2]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[3]  E. Silver MARKOVIAN DECISION PROCESSES WITH UNCERTAIN TRANSITION PROBABILITIES OR REWARDS , 1963 .

[4]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[5]  Doina Precup,et al.  Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[6]  M. Littman,et al.  Approaching Bayes-optimalilty using Monte-Carlo tree search , 2011 .

[7]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[8]  D. Ernst,et al.  Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison , 2014 .

[9]  J. J. Martin Bayesian Decision Problems and Markov Chains , 1967 .

[10]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[11]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[12]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[13]  Yoshua Bengio,et al.  Boosting Neural Networks , 2000, Neural Computation.

[14]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[15]  Lucian Busoniu,et al.  Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[16]  David Hsu,et al.  Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[17]  Damien Ernst,et al.  Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning , 2012, EWRL.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Damien Ernst,et al.  Benchmarking for Bayesian Reinforcement Learning , 2016, PloS one.

[20]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[21]  Sergey Levine,et al.  Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.