Reinforcement Learning Algorithms for MDPs

Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. In this article we focus on a few selected algorithms of reinforcement learning which build on the powerful theory of dynamic programming. Keywords: reinforcement learning; Markov Decision Processes; temporal difference learning; stochastic approximation; function approximation; least-squares methods; Q-learning; actor-critic methods; policy gradient; natural gradient

[1]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[2]  Ferenc Beleznay,et al.  Comparing Value-Function Estimation Algorithms in Undiscounted Problems , 2012 .

[3]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[4]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[5]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[6]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[7]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[8]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[9]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[10]  H. He,et al.  Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[11]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[12]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[13]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[14]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[15]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[16]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[17]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[20]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[21]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[22]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[23]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[24]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[25]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[26]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[27]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[28]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[29]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[30]  C. Lemieux Monte Carlo and Quasi-Monte Carlo Sampling , 2009 .

[31]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[32]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[33]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[34]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[35]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[36]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[37]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[38]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[39]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[40]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[41]  V. B. Tadic,et al.  On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.

[42]  Ronald Ortner,et al.  Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[43]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[44]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[45]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[46]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[47]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[48]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[49]  John N. Tsitsiklis,et al.  Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[50]  Dale Schuurmans,et al.  Learning Exercise Policies for American Options , 2009, AISTATS.

[51]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[52]  Carlos Domingo,et al.  Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm , 1999, ALT.

[53]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[54]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[55]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[56]  Sean P. Meyn,et al.  Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..

[57]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[58]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[59]  Koby Crammer,et al.  Advances in Neural Information Processing Systems 14 , 2002 .

[60]  Benjamin Van Roy,et al.  A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..

[61]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[62]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[63]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[64]  Naoki Abe,et al.  Cross channel optimized marketing by reinforcement learning , 2004, KDD.

[65]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[66]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[67]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[68]  Warren B. Powell,et al.  An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application , 2009, Transp. Sci..

[69]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[70]  Dimitri P. Bertsekas,et al.  New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[71]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[72]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[73]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[74]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[75]  L. Sherry,et al.  Estimating Taxi-out times with a reinforcement learning algorithm , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.

[76]  Csaba Szepesvári,et al.  Learning and Exploitation Do Not Conflict Under Minimax Optimality , 1997, ECML.

[77]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[78]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[79]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[80]  Johannes Fürnkranz,et al.  Proceedings of the 17th European conference on Machine Learning , 2006 .

[81]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[82]  J. Albus A Theory of Cerebellar Function , 1971 .

[83]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[84]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[85]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[86]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[87]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[88]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[89]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[90]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[91]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[92]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[93]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[94]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[95]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[96]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[97]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[98]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[99]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[100]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[101]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[102]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[103]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[104]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[105]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[106]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[107]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[108]  Prasad Tadepalli,et al.  Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery , 2006, ECML.

[109]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[110]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[111]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[112]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[113]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[114]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[115]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[116]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[117]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[118]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[119]  P. Glynn,et al.  Opportunities and challenges in using online preference data for vehicle pricing: A case study at General Motors , 2006 .

[120]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[121]  Michael C. Fu,et al.  An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.

[122]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[123]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[124]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[125]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[126]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[127]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[128]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[129]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[130]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[131]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[132]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[133]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[134]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[135]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[136]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[137]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[138]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[139]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[140]  V. Borkar Stochastic approximation with two time scales , 1997 .

[141]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[142]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[143]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[144]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[145]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[146]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[147]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[148]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[149]  Warren B. Powell,et al.  An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem , 2009, Math. Oper. Res..

[150]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[151]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[152]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[153]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[154]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[155]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[156]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[157]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[158]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[159]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[160]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[161]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[162]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[163]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[164]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[165]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[166]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[167]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[168]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[169]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[170]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[171]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[172]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[173]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[174]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[175]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[176]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[177]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[178]  Piotr Berman,et al.  On-line Searching and Navigation , 1996, Online Algorithms.

[179]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[180]  Csaba Szepesvari Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[181]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[182]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[183]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..