Universal Reinforcement Learning Algorithms: Survey and Experiments

Many state-of-the-art reinforcement learning (RL) algorithms typically assume that the environment is an ergodic Markov Decision Process (MDP). In contrast, the field of universal reinforcement learning (URL) is concerned with algorithms that make as few assumptions as possible about the environment. The universal Bayesian agent AIXI and a family of related URL algorithms have been developed in this setting. While numerous theoretical optimality results have been proven for these agents, there has been no empirical investigation of their behavior to date. We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an open-source reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas.

[1]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[2]  Marcus Hutter,et al.  Generalised Discount Functions applied to a Monte-Carlo AI u Implementation , 2017, AAMAS.

[3]  R. Lathe Phd by thesis , 1988, Nature.

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Algorithmic Learning Theory , 1994, Lecture Notes in Computer Science.

[6]  Laurent Orseau,et al.  Optimality Issues of Universal Greedy Agents with Static Priors , 2010, ALT.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Tor Lattimore,et al.  Theory of general reinforcement learning , 2014 .

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Joel Veness,et al.  A Monte-Carlo AIXI Approximation , 2009, J. Artif. Intell. Res..

[11]  Marcus Hutter,et al.  Count-Based Exploration in Feature Space for Reinforcement Learning , 2017, IJCAI.

[12]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[13]  K. Pearson Biometrika , 1902, The American Naturalist.

[14]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[15]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[16]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[17]  Laurent Orseau,et al.  Thompson Sampling is Asymptotically Optimal in General Environments , 2016, UAI.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Jan Leike,et al.  Nonparametric General Reinforcement Learning , 2016, ArXiv.

[20]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[21]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[22]  Marcus Hutter,et al.  Open Problems in Universal Induction & Intelligence , 2009, Algorithms.

[23]  Laurent Orseau,et al.  Universal knowledge-seeking agents , 2011, Theor. Comput. Sci..

[24]  Paul W. Goldberg,et al.  Autonomous Agents and Multiagent Systems , 2016, Lecture Notes in Computer Science.

[25]  Marcus Hutter,et al.  Bad Universal Priors and Notions of Optimality , 2015, COLT.

[26]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[27]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[28]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[29]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[30]  Marcus Hutter,et al.  Rationality, optimism and guarantees in general reinforcement learning , 2015, J. Mach. Learn. Res..

[31]  Laurent Orseau,et al.  Universal Knowledge-Seeking Agents for Stochastic Environments , 2013, ALT.

[32]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[33]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[34]  Marcus Hutter,et al.  On the Computability of Solomonoff Induction and Knowledge-Seeking , 2015, ALT.

[35]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.