Alternating Optimisation and Quadrature for Robust Reinforcement Learning

Bayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables - state features that are randomly determined by the environment in a physical setting but are controllable in a simulator. This paper considers the problem of finding an optimal policy while taking into account the impact of environment variables. We present alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. ALOQ is robust to the presence of significant rare events, which may not be observable under random sampling, but have a considerable impact on determining the optimal policy. We provide experimental results demonstrating our approach learning more efficiently than existing methods.

[1]  Jan Peters,et al.  Bayesian optimization for learning gaits under uncertainty , 2015, Annals of Mathematics and Artificial Intelligence.

[2]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[3]  Kevin P. Murphy,et al.  An experimental investigation of model-based parameter optimisation: SPO and beyond , 2009, GECCO.

[4]  D. Dennis,et al.  SDO : A Statistical Method for Global Optimization , 1997 .

[6]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[7]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[8]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[9]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  Michael A. Osborne,et al.  Probabilistic Integration: A Role for Statisticians in Numerical Analysis? , 2015 .

[12]  Antoine Cully,et al.  Robots that can adapt like animals , 2014, Nature.

[13]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[14]  Jasper Snoek,et al.  Input Warping for Bayesian Optimization of Non-Stationary Functions , 2014, ICML.

[15]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized control of simulated helicopters , 2011, Evol. Intell..

[16]  Nando de Freitas,et al.  Active Policy Learning for Robot Planning and Exploration under Uncertainty , 2007, Robotics: Science and Systems.

[17]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Nando de Freitas,et al.  A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot , 2009, Auton. Robots.

[19]  Carl E. Rasmussen,et al.  Active Learning of Model Evidence Using Bayesian Quadrature , 2012, NIPS.

[20]  Thomas Bartz-Beielstein,et al.  Sequential parameter optimization , 2005, 2005 IEEE Congress on Evolutionary Computation.

[21]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[22]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[23]  Thomas J. Santner,et al.  Sequential design of computer experiments to minimize integrated response functions , 2000 .

[24]  D. Dennis,et al.  A statistical method for global optimization , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[25]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[26]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.