An object-oriented representation for efficient reinforcement learning

Rich representations in reinforcement learning have been studied for the purpose of enabling generalization and making learning feasible in large state spaces. We introduce Object-Oriented MDPs (OO-MDPs), a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities. We introduce a learning algorithm for deterministic OO-MDPs and prove a polynomial bound on its sample complexity. We illustrate the performance gains of our representation and algorithm in the well-known Taxi domain, plus a real-life videogame.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  A. Hordijk,et al.  Linear Programming and Markov Decision Chains , 1979 .

[4]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[5]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[6]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  C. Watkins Learning from delayed rewards , 1989 .

[8]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[9]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[10]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[11]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[12]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[13]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[14]  Saso Dzeroski,et al.  PAC-learnability of determinate logic programs , 1992, COLT '92.

[15]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[16]  Magnus Borga,et al.  Hierarchical Reinforcement Learning , 1993 .

[17]  Craig Boutilier,et al.  Using Abstractions for Decision-Theoretic Planning with Time Constraints , 1994, AAAI.

[18]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[21]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[22]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[23]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[24]  William W. Cohen Pac-learning Recursive Logic Programs: Negative Results , 1994, J. Artif. Intell. Res..

[25]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[26]  William W. Cohen Pac-Learning Recursive Logic Programs: Efficient Algorithms , 1994, J. Artif. Intell. Res..

[27]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[28]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[29]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[30]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[31]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[32]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[33]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[34]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[35]  John Langford,et al.  Probabilistic Planning in the Graphplan Framework , 1999, ECP.

[36]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[37]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[38]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[39]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[40]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[41]  Balaraman Ravindran,et al.  Symmetries and Model Minimization in Markov Decision Processes , 2001 .

[42]  John N. Tsitsiklis,et al.  Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[43]  B. Scholl Objects and attention: the state of the art , 2001, Cognition.

[44]  Tim Oates,et al.  The Thing that we Tried Didn't Work very Well: Deictic Representation in Reinforcement Learning , 2002, UAI.

[45]  L. Kaelbling,et al.  Learning with Deictic Representation , 2002 .

[46]  Dale Schuurmans,et al.  Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs , 2002, ICML.

[47]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[48]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[49]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[50]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[51]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[52]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[53]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[54]  Maria Fox,et al.  PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , 2003, J. Artif. Intell. Res..

[55]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[56]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[57]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[58]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[59]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[60]  Roni Khardon,et al.  Learning to Take Actions , 1996, Machine Learning.

[61]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[62]  Luc De Raedt,et al.  Relational Reinforcement Learning , 2001, Machine Learning.

[63]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[64]  A. Barto,et al.  An algebraic approach to abstraction in reinforcement learning , 2004 .

[65]  Thore Graepel,et al.  LEARNING TO FIGHT , 2004 .

[66]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[67]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[68]  Håkan L. S. Younes,et al.  The First Probabilistic Track of the International Planning Competition , 2005, J. Artif. Intell. Res..

[69]  Martijn van Otterlo,et al.  A survey of reinforcement learning in relational domains , 2005 .

[70]  Gerald Tesauro,et al.  Online Resource Allocation Using Decompositional Reinforcement Learning , 2005, AAAI.

[71]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[72]  Multi-Agent Environment,et al.  Hierarchical Reinforcement Learning in , 2005 .

[73]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[74]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[75]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[76]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[77]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[78]  Michael L. Littman,et al.  A hierarchical approach to efficient reinforcement learning in deterministic domains , 2006, AAMAS '06.

[79]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[80]  Chih-Han Yu,et al.  Quadruped robot obstacle negotiation via reinforcement learning , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[81]  Kathryn E. Merrick,et al.  Motivated reinforcement learning for non-player characters in persistent computer game worlds , 2006, ACE '06.

[82]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[83]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[84]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[85]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[86]  Michael L. Littman,et al.  Efficient Structure Learning in Factored-State MDPs , 2007, AAAI.

[87]  Balaraman Ravindran,et al.  Deictic Option Schemas , 2007, IJCAI.

[88]  Alexander L. Strehl,et al.  Model-Based Reinforcement Learning in Factored-State MDPs , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[89]  Michael L. Littman,et al.  Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.

[90]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[91]  L. P. Kaelbling,et al.  Learning Symbolic Models of Stochastic Domains , 2007, J. Artif. Intell. Res..

[92]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[93]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[94]  S. Shalev-Shwartz Low ` 1-Norm and Guarantees on Sparsifiability , 2008 .

[95]  Alejandro Pazos Sierra,et al.  Encyclopedia of Artificial Intelligence , 2008 .

[96]  A. Woodward,et al.  Learning and the Infant Mind , 2008 .

[97]  R. Baillargeon,et al.  An Account of Infants' Physical Reasoning , 2008 .

[98]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[99]  Martijn van Otterlo,et al.  The logic of adaptive behavior : knowledge representation and algorithms for the Markov decision process framework in first-order domains , 2008 .

[100]  Luc De Raedt,et al.  Logical and Relational Learning: From ILP to MRDM (Cognitive Technologies) , 2008 .

[101]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[102]  Lihong Li,et al.  The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.

[103]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[104]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[105]  Michael L. Littman,et al.  Hierarchical Reinforcement Learning , 2009, Encyclopedia of Artificial Intelligence.

[106]  R. Baillargeon,et al.  Young infants’ reasoning about physical events involving inert and self-propelled objects , 2009, Cognitive Psychology.

[107]  Thomas J. Walsh,et al.  Generalizing Apprenticeship Learning across Hypothesis Classes , 2010, ICML.