On the Convergence of Bounded Agents

When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly less clear. In this paper, we propose two complementary accounts of agent convergence in a framing of the reinforcement learning problem that centers around bounded agents. The first view says that a bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they accommodate typical views of convergence in standard settings, and prove several facts about their nature and relationship. We take these perspectives, definitions, and analysis to bring clarity to a central idea of the field.

[1]  John D. Martin,et al.  Settling the Reward Hypothesis , 2022, ICML.

[2]  Sultan Javed Majeed,et al.  Abstractions of General Reinforcement Learning , 2021, ArXiv.

[3]  Zheng Wen,et al.  Reinforcement Learning, Bit by Bit , 2021, Found. Trends Mach. Learn..

[4]  Benjamin Van Roy,et al.  Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States , 2021, J. Mach. Learn. Res..

[5]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[6]  Sheila A. McIlraith,et al.  Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning , 2020, J. Artif. Intell. Res..

[7]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[8]  Marcus Hutter,et al.  A Strongly Asymptotically Optimal Agent in General Environments , 2019, IJCAI.

[9]  Sultan Javed Majeed,et al.  Performance Guarantees for Homomorphisms Beyond Markov Decision Processes , 2018, AAAI.

[10]  Marcus Hutter,et al.  On Q-learning Convergence for Non-Markov Decision Processes , 2018, IJCAI.

[11]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[12]  Marcus Hutter Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[13]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[14]  Jan Leike,et al.  Nonparametric General Reinforcement Learning , 2016, ArXiv.

[15]  Laurent Orseau,et al.  Thompson Sampling is Asymptotically Optimal in General Environments , 2016, UAI.

[16]  Thomas L. Griffiths,et al.  Rational Use of Cognitive Resources: Levels of Analysis Between the Computational and the Algorithmic , 2015, Top. Cogn. Sci..

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  Daniele Calandriello,et al.  Sparse multi-task reinforcement learning , 2014, Intelligenza Artificiale.

[19]  Satinder Singh,et al.  Computational Rationality: Linking Mechanism and Behavior Through Bounded Utility Maximization , 2014, Top. Cogn. Sci..

[20]  Tor Lattimore,et al.  The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[21]  P. Todd,et al.  Ecological Rationality: Intelligence in the World , 2012 .

[22]  Tor Lattimore,et al.  Asymptotically Optimal Agents , 2011, ALT.

[23]  Pedro A. Ortega,et al.  A Unified Framework for Resource-Bounded Autonomous Agents Interacting with Unknown Environments , 2011 .

[24]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[25]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[26]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[27]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[28]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[29]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[30]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[31]  Carol Rovane,et al.  What is an Agent? , 2004, Synthese.

[32]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[33]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[34]  Marcus Hutter,et al.  A Theory of Universal Artificial Intelligence based on Algorithmic Complexity , 2000, ArXiv.

[35]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[36]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[37]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[38]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[39]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[40]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[41]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[42]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[43]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[44]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[45]  Devika Subramanian,et al.  Provably Bounded Optimal Agents , 1993, IJCAI.

[46]  Martin L. Puterman,et al.  On the Convergence of Policy Iteration in Stationary Dynamic Programming , 1979, Math. Oper. Res..

[47]  R. Bellman A Markovian Decision Process , 1957 .

[48]  H. Simon,et al.  A Behavioral Model of Rational Choice , 1955 .

[49]  Tor Lattimore,et al.  Theory of general reinforcement learning , 2014 .

[50]  G. Konidaris A Framework for Transfer in Reinforcement Learning , 2006 .

[51]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[52]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[53]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[54]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[55]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[56]  Christopher Cherniak,et al.  Minimal Rationality , 1986, Computational models of cognition and perception.