Bayesian Optimal Control of Smoothly Parameterized Systems

We study Bayesian optimal control of a general class of smoothly parameterized Markov decision problems (MDPs). We propose a lazy version of the so-called posterior sampling method, a method that goes back to Thompson and Strens, more recently studied by Osband, Russo and van Roy. While Osband et al. derived a bound on the (Bayesian) regret of this method for undiscounted total cost episodic, finite state and action problems, we consider the continuing, average cost setting with no cardinality restrictions on the state or action spaces. While in the episodic setting, it is natural to switch to a new policy at the episode-ends, in the continuing average cost framework we must introduce switching points explicitly and in a principled fashion, or the regret could grow linearly. Our lazy method introduces these switching points based on monitoring the uncertainty left about the unknown parameter. To develop a suitable and easy-to-compute uncertainty measure, we introduce a new "average local smoothness" condition, which is shown to be satisfied in common examples. Under this, and some additional mild conditions, we derive rate-optimal bounds on the regret of our algorithm. Our general approach allows us to use a single algorithm and a single analysis for a wide range of problems, such as finite MDPs or linear quadratic regulation, both being instances of smoothly parameterized MDPs. The effectiveness of our method is illustrated by means of a simulated example.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[3]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[4]  Benjamin Van Roy,et al.  Approximate Linear Programming for Average-Cost Dynamic Programming , 2002, NIPS.

[5]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[6]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[7]  Nahum Shimkin,et al.  Nonlinear Control Systems , 2008 .

[8]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[9]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[10]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[11]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[12]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[13]  Shie Mannor,et al.  Bayesian Reinforcement Learning , 2012, Reinforcement Learning.

[14]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[15]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[16]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[17]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[18]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[19]  Peter Dayan,et al.  Better Optimism By Bayes: Adaptive Planning with Rich Models , 2014, ArXiv.

[20]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[21]  M. Hoagland,et al.  Feedback Systems An Introduction for Scientists and Engineers SECOND EDITION , 2015 .