Making Sense of Reinforcement Learning and Probabilistic Inference

Reinforcement learning (RL) combines a control problem with statistical estimation: The system dynamics are not known to the agent, but can be learned through experience. A recent line of research casts 'RL as inference' and suggests a particular framework to generalize the RL problem as probabilistic inference. Our paper surfaces a key shortcoming in that approach, and clarifies the sense in which RL can be coherently cast as an inference problem. In particular, an RL agent must consider the effects of its actions upon future rewards and observations: The exploration-exploitation tradeoff. In all but the most simple settings, the resulting inference is computationally intractable so that practical RL algorithms must resort to approximation. We demonstrate that the popular 'RL as inference' approximation can perform poorly in even very basic problems. However, we show that with a small modification the framework does yield algorithms that can provably perform well, and we show that the resulting algorithm is equivalent to the recently proposed K-learning, which we further connect with Thompson sampling.

[1]  Brendan O'Donoghue,et al.  Variational Bayesian Reinforcement Learning with Regret Bounds , 2018, NeurIPS.

[2]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[3]  Matthew Fellows,et al.  VIREL: A Variational Inference Framework for Reinforcement Learning , 2018, NeurIPS.

[4]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[5]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[6]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[7]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[8]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[9]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[11]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[12]  Claudio Gentile,et al.  Boltzmann Exploration Done Right , 2017, NIPS.

[13]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[14]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[15]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[16]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[17]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[21]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[22]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[25]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[28]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[29]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[30]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[31]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[32]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[33]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[34]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.

[35]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[36]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[37]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[38]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[39]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[40]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[41]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[42]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[43]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[44]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[45]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[46]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[47]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[48]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[49]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[50]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[51]  C. Watkins Learning from delayed rewards , 1989 .

[52]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[53]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[54]  R. Fisher The Advanced Theory of Statistics , 1943, Nature.

[55]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .