Algorithms for Reinforcement Learning

Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming.We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  J. Albus A Theory of Cerebellar Function , 1971 .

[3]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[6]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[7]  C. Watkins Learning from delayed rewards , 1989 .

[8]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[9]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[10]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[11]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[12]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[13]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[14]  W. Härdle Applied Nonparametric Regression , 1992 .

[15]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[16]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[17]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[18]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[20]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[21]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[24]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[25]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[26]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[27]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[28]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[29]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[30]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[31]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[32]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[33]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[34]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[35]  Piotr Berman,et al.  On-line Searching and Navigation , 1996, Online Algorithms.

[36]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[37]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[38]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[39]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[40]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[41]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[42]  Csaba Szepesvári,et al.  Learning and Exploitation Do Not Conflict Under Minimax Optimality , 1997, ECML.

[43]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[44]  V. Borkar Stochastic approximation with two time scales , 1997 .

[45]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[46]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[47]  Csaba Szepesvari Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .

[48]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[49]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[50]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[51]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[52]  Carlos Domingo,et al.  Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm , 1999, ALT.

[53]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[54]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[55]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[56]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[57]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[58]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[59]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[60]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[61]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[62]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[63]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[64]  John N. Tsitsiklis,et al.  Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[65]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[66]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[67]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[68]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[69]  Marcus Hutter,et al.  Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions , 2000, ECML.

[70]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[71]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[72]  Koby Crammer,et al.  Advances in Neural Information Processing Systems 14 , 2002 .

[73]  H. He,et al.  Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..

[74]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[75]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[76]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[77]  Sean P. Meyn,et al.  Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..

[78]  G. Wahba REPRODUCING KERNEL HILBERT SPACES - TWO BRIEF REVIEWS , 2003 .

[79]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[80]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[81]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[82]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[83]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[84]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[85]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[86]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[87]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[88]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[89]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[90]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[91]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[92]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[93]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[94]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[95]  Naoki Abe,et al.  Cross channel optimized marketing by reinforcement learning , 2004, KDD.

[96]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[97]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[98]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[99]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[100]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[101]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[102]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[103]  V. B. Tadic,et al.  On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.

[104]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[105]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[106]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[107]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[108]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[109]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[110]  Marcus Hutter Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[111]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[112]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[113]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[114]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[115]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[116]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[117]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[118]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[119]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[120]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[121]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[122]  Prasad Tadepalli,et al.  Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery , 2006, ECML.

[123]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[124]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[125]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[126]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[127]  Benjamin Van Roy,et al.  A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..

[128]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[129]  Johannes Fürnkranz,et al.  Proceedings of the 17th European conference on Machine Learning , 2006 .

[130]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[131]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[132]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[133]  P. Glynn,et al.  Opportunities and challenges in using online preference data for vehicle pricing: A case study at General Motors , 2006 .

[134]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[135]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[136]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[137]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[138]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[139]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[140]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[141]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[142]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[143]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[144]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[145]  Michael C. Fu,et al.  An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.

[146]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[147]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[148]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[149]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[150]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[151]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[152]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[153]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[154]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[155]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[156]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[157]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[158]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[159]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[160]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[161]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[162]  L. Sherry,et al.  Estimating Taxi-out times with a reinforcement learning algorithm , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.

[163]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[164]  Dimitri P. Bertsekas,et al.  New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[165]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[166]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[167]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[168]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[169]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[170]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[171]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[172]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[173]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[174]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[175]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[176]  Warren B. Powell,et al.  An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem , 2009, Math. Oper. Res..

[177]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[178]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[179]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[180]  Dale Schuurmans,et al.  Learning Exercise Policies for American Options , 2009, AISTATS.

[181]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[182]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[183]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[184]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[185]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[186]  Warren B. Powell,et al.  An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application , 2009, Transp. Sci..

[187]  C. Lemieux Monte Carlo and Quasi-Monte Carlo Sampling , 2009 .

[188]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[189]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[190]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[191]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[192]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[193]  Ronald Ortner,et al.  Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[194]  Csaba Szepesvari,et al.  The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.

[195]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[196]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[197]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[198]  Kevin D. Glazebrook,et al.  Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .

[199]  Ferenc Beleznay,et al.  Comparing Value-Function Estimation Algorithms in Undiscounted Problems , 2012 .

[200]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .