Characterizing reinforcement learning methods through parameterized learning problems

The field of reinforcement learning (RL) has been energized in the past few decades by elegant theoretical results indicating under what conditions, and how quickly, certain algorithms are guaranteed to converge to optimal policies. However, in practical problems, these conditions are seldom met. When we cannot achieve optimality, the performance of RL algorithms must be measured empirically. Consequently, in order to meaningfully differentiate learning methods, it becomes necessary to characterize their performance on different problems, taking into account factors such as state estimation, exploration, function approximation, and constraints on computation and memory. To this end, we propose parameterized learning problems, in which such factors can be controlled systematically and their effects on learning methods characterized through targeted studies. Apart from providing very precise control of the parameters that affect learning, our parameterized learning problems enable benchmarking against optimal behavior; their relatively small sizes facilitate extensive experimentation.Based on a survey of existing RL applications, in this article, we focus our attention on two predominant, “first order” factors: partial observability and function approximation. We design an appropriate parameterized learning problem, through which we compare two qualitatively distinct classes of algorithms: on-line value function-based methods and policy search methods. Empirical comparisons among various methods within each of these classes project Sarsa(λ) and Q-learning(λ) as winners among the former, and CMA-ES as the winner in the latter. Comparing Sarsa(λ) and CMA-ES further on relevant problem instances, our study highlights regions of the problem space favoring their contrasting approaches. Short run-times for our experiments allow for an extensive search procedure that provides additional insights on relationships between method-specific parameters—such as eligibility traces, initial weights, and population sizes—and problem instances.

[1]  Shimon Whiteson,et al.  Adaptive job routing and scheduling , 2004, Eng. Appl. Artif. Intell..

[2]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[5]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[6]  L. Buşoniu Evolutionary function approximation for reinforcement learning , 2006 .

[7]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[8]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[9]  H. Beyer Evolutionary algorithms in noisy environments : theoretical issues and guidelines for practice , 2000 .

[10]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[11]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[12]  Adele E. Howe,et al.  How evaluation guides AI research , 1988 .

[13]  Michael Kearns,et al.  Reinforcement learning for optimized trade execution , 2006, ICML.

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[16]  Lihong Li,et al.  The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  Kevin Leyton-Brown,et al.  SATzilla: Portfolio-based Algorithm Selection for SAT , 2008, J. Artif. Intell. Res..

[19]  Daniel Kudenko,et al.  Improving Optimistic Exploration in Model-Free Reinforcement Learning , 2009, ICANNGA.

[20]  R. Bellman Dynamic programming. , 1957, Science.

[21]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[22]  Peter Stone,et al.  Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[23]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[24]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[25]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[26]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[28]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[29]  Michael L. Littman,et al.  An optimization-based categorization of reinforcement learning environments , 1993 .

[30]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[31]  Jürgen Schmidhuber,et al.  A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[32]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[33]  Rajarshi Das,et al.  On the use of hybrid reinforcement learning for autonomic resource allocation , 2007, Cluster Computing.

[34]  Pat Langley Machine Learning as an Experimental Science , 2005, Machine Learning.

[35]  Bart Selman,et al.  Algorithm portfolios , 2001, Artif. Intell..

[36]  Risto Miikkulainen,et al.  Efficient evolution of neural networks through complexification , 2004 .

[37]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[38]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[39]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[40]  Michael R. James,et al.  SarsaLandmark: an algorithm for learning in POMDPs with landmarks , 2009, AAMAS.

[41]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[42]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[43]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[44]  Christian Igel,et al.  Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem , 2008, EWRL.

[45]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[46]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[47]  Yishay Mansour,et al.  Convergence of Optimistic and Incremental Q-Learning , 2001, NIPS.

[48]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[49]  Doina Precup,et al.  Using MDP Characteristics to Guide Exploration in Reinforcement Learning , 2003, ECML.

[50]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[51]  Martin A. Riedmiller,et al.  A Case Study on Improving Defense Behavior in Soccer Simulation 2D: The NeuroHassle Approach , 2009, RoboCup.

[52]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[53]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[54]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[55]  Risto Miikkulainen,et al.  Solving Non-Markovian Control Tasks with Neuro-Evolution , 1999, IJCAI.

[56]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[57]  Olivier Sigaud,et al.  Learning the structure of Factored Markov Decision Processes in reinforcement learning problems , 2006, ICML.

[58]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[59]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[60]  Philip N. Sabes Approximating Q values with Basis Function Representations , 2004 .

[61]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[62]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[63]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[64]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[65]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[66]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[67]  Christian Igel,et al.  Similarities and differences between policy gradient methods and evolution strategies , 2008, ESANN.

[68]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[69]  Shimon Whiteson,et al.  Protecting against evaluation overfitting in empirical reinforcement learning , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[70]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[71]  Jay H. Lee,et al.  Neuro-dynamic programming method for MPC 1 , 2001 .

[72]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[73]  Peter Bock,et al.  Using a Genetic Algorithm to Search for the Representational Bias of a Collective Reinforcement Learner , 1994, PPSN.

[74]  Yoav Shoham,et al.  Boosting as a Metaphor for Algorithm Design , 2003, CP.

[75]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[76]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[77]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[78]  Dieter Fox,et al.  Reinforcement learning for sensing strategies , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[79]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[80]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[81]  Carla E. Brodley,et al.  Recursive automatic bias selection for classifier construction , 1995, Machine Learning.

[82]  Risto Miikkulainen,et al.  Accelerated Neural Evolution through Cooperatively Coevolved Synapses , 2008, J. Mach. Learn. Res..

[83]  Helen G. Cobb Inductive Biases in a Reinforcement Learner , 1992 .

[84]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[85]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[86]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[87]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[88]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[89]  Julian Togelius,et al.  Ontogenetic and Phylogenetic Reinforcement Learning , 2009, Künstliche Intell..

[90]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[91]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[92]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[93]  Shane Legg,et al.  Temporal Difference Updating without a Learning Rate , 2007, NIPS.

[94]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[95]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[96]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[97]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[98]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[99]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[100]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[101]  Paul R. Cohen,et al.  How Evaluation Guides AI Research: The Message Still Counts More than the Medium , 1988, AI Mag..

[102]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[103]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[104]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[105]  Petros Koumoutsakos,et al.  A Method for Handling Uncertainty in Evolutionary Optimization With an Application to Feedback Control of Combustion , 2009, IEEE Transactions on Evolutionary Computation.

[106]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[107]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[108]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[109]  Christian Igel,et al.  Efficient covariance matrix update for variable metric evolution strategies , 2009, Machine Learning.

[110]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[111]  Shimon Whiteson,et al.  Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning , 2010, Autonomous Agents and Multi-Agent Systems.

[112]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[113]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[114]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[115]  András Lörincz,et al.  Learning to Play Using Low-Complexity Rule-Based Policies: Illustrations through Ms. Pac-Man , 2007, J. Artif. Intell. Res..

[116]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[117]  Chih-Han Yu,et al.  Quadruped robot obstacle negotiation via reinforcement learning , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[118]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[119]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[120]  Frank Kirchner,et al.  Analysis of an evolutionary reinforcement learning method in a multiagent domain , 2008, AAMAS.

[121]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[122]  Scott Sanner,et al.  Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[123]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[124]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[125]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[126]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[127]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[128]  Risto Miikkulainen,et al.  Active Guidance for a Finless Rocket Using Neuroevolution , 2003, GECCO.