Policy evaluation with temporal differences: a survey and comparison

Extended abstract of the article: Christoph Dann, Gerhard Neumann, Jan Peters (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research, 15, 809-883.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  M. Rosenblatt Markov Processes, Structure and Asymptotic Behavior , 1971 .

[3]  James S. Albus,et al.  I A New Approach to Manipulator Control: The I Cerebellar Model Articulation Controller , 1975 .

[4]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[5]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[6]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[7]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[8]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[9]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[10]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[11]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[14]  Charles W. Anderson,et al.  Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[15]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[16]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[19]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[20]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[21]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[22]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[23]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[24]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[25]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[26]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[27]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[28]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[29]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[30]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[31]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[32]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[33]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[34]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[35]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[36]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[37]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[38]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[39]  Daniel Polani,et al.  Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[40]  H. Robbins A Stochastic Approximation Method , 1951 .

[41]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[42]  Martin A. Riedmiller,et al.  On Experiences in a Complex and Competitive Gaming Domain: Reinforcement Learning Meets RoboCup , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[43]  Shane Legg,et al.  Temporal Difference Updating without a Learning Rate , 2007, NIPS.

[44]  M. Loth,et al.  Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[45]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[46]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[47]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[48]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[49]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[50]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[51]  Lihong Li,et al.  A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[52]  David Silver,et al.  Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go , 2022 .

[53]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[54]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[55]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[56]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[57]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[58]  Richard W. Cottle,et al.  Linear Complementarity Problem , 2009, Encyclopedia of Optimization.

[59]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[60]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[61]  S. Mahadevan,et al.  Sparse Approximate Policy Evaluation using Graph-based Basis Functions , 2009 .

[62]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[63]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[64]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[65]  Csaba Szepesvári,et al.  Model Selection in Reinforcement Learning , 2011, Machine Learning.

[66]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[67]  Masashi Sugiyama,et al.  Feature Selection for Reinforcement Learning: Evaluating Implicit State-Reward Dependency via Conditional Mutual Information , 2010, ECML/PKDD.

[68]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[69]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[70]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[71]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[72]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[73]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[74]  Lance Sherry,et al.  Accuracy of reinforcement learning algorithms for predicting aircraft taxi-out times: A case-study of Tampa Bay departures , 2010 .

[75]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[76]  Matthew W. Hoffman,et al.  Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization , 2011, EWRL.

[77]  Matthieu Geist,et al.  ℓ1-Penalized Projected Bellman Residual , 2011, EWRL.

[78]  Matthieu Geist,et al.  Recursive Least-Squares Learning with Eligibility Traces , 2011, EWRL.

[79]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[80]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[81]  Alborz Geramifard,et al.  Online Discovery of Feature Dependencies , 2011, ICML.

[82]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[83]  Matthieu Geist,et al.  A Dantzig Selector Approach to Temporal Difference Learning , 2012, ICML.

[84]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Andrew G. Barto,et al.  Adaptive Step-Size for Online Temporal Difference Learning , 2012, AAAI.

[86]  Dominik Meyer,et al.  L1 Regularized Gradient Temporal-Difference Learning , 2012, EWRL 2012.

[87]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[88]  Ronald Parr,et al.  Greedy Algorithms for Sparse Reinforcement Learning , 2012, ICML.

[89]  Ronald E. Parr,et al.  L1 Regularized Linear Temporal Difference Learning , 2012 .

[90]  B. A. Pires,et al.  Statistical analysis of L1-penalized linear estimation with applications , 2012 .

[91]  Alborz Geramifard,et al.  Batch-iFDD for Representation Expansion in Large MDPs , 2013, UAI.

[92]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..