Scalar Posterior Sampling with Applications

We propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-art PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parameterization for a large class of problems in sequential recommendations.

[1]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[2]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[3]  Yi Ouyang,et al.  Learning-based Control of Unknown Linear Systems with Thompson Sampling , 2017, ArXiv.

[4]  Csaba Szepesvári,et al.  Bayesian Optimal Control of Smoothly Parameterized Systems , 2015, UAI.

[5]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[6]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[7]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[8]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[9]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[11]  Zheng Wen,et al.  An Interactive Points of Interest Guidance System , 2017, IUI Companion.

[12]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[13]  Benjamin Van Roy,et al.  Posterior Sampling for Reinforcement Learning Without Episodes , 2016, ArXiv.

[14]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[15]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[16]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[17]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .