Asymptotic Learnability of Reinforcement Problems with Arbitrary Dependence

We address the problem of reinforcement learning in which observations may exhibit an arbitrary form of stochastic dependence on past observations and actions, i.e. environments more general than (PO) MDPs. The task for an agent is to attain the best possible asymptotic reward where the true generating environment is unknown but belongs to a known countable family of environments. We find some sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class. We analyze how tight these conditions are and how they relate to different probabilistic assumptions known in reinforcement learning and related fields, such as Markov Decision Processes and mixing conditions.

[1]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[2]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[3]  Marcus Hutter Optimality of universal Bayesian prediction for general loss and alphabet , 2003 .

[4]  Nimrod Megiddo,et al.  How to Combine Expert (and Novice) Advice when Actions Impact the Environment? , 2003, NIPS.

[5]  Marcus Hutter,et al.  Universal Learning of Repeated Matrix Games , 2005, ArXiv.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Marcus Hutter,et al.  Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet , 2003, J. Mach. Learn. Res..

[8]  Marcus Hutter,et al.  Defensive Universal Learning with Experts , 2005, ALT.

[9]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[10]  Marcus Hutter,et al.  Prediction with Expert Advice by Following the Perturbed Leader for General Weights , 2004, ALT.

[11]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[12]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[13]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Ronen I. Brafman,et al.  A Near-Optimal Poly-Time Algorithm for Learning a class of Stochastic Games , 1999, IJCAI.

[15]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[16]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[17]  Denis Bosq,et al.  Nonparametric statistics for stochastic processes , 1996 .

[18]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .