An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.

[1]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[2]  Lennart Ljung,et al.  Analysis of recursive stochastic algorithms , 1977 .

[3]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4]  C. Morris Natural Exponential Families with Quadratic Variance Functions , 1982 .

[5]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[6]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[7]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[8]  Gerald Tesauro,et al.  TD-Gammon: A Self-Teaching Backgammon Program , 1995 .

[9]  V. Borkar Probability Theory: An Advanced Course , 1995 .

[10]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[13]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[14]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[15]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[16]  V. Borkar Stochastic approximation with two time scales , 1997 .

[17]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[20]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[21]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[22]  Artur Merke,et al.  Convergent Combinations of Reinforcement Learning with Linear Function Approximation , 2002, NIPS.

[23]  Artur Merke,et al.  TD(0) Converges Provably Faster than the Residual Gradient Algorithm , 2003, ICML.

[24]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[25]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[26]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[27]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[28]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[29]  Mauro Birattari,et al.  Model-Based Search for Combinatorial Optimization: A Critical Survey , 2004, Ann. Oper. Res..

[30]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[31]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[32]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[33]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[34]  C. Guestrin,et al.  Solving Factored MDPs with Hybrid State and Action Variables , 2006, J. Artif. Intell. Res..

[35]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[36]  H. Robbins A Stochastic Approximation Method , 1951 .

[37]  Michael C. Fu,et al.  A Model Reference Adaptive Search Method for Global Optimization , 2007, Oper. Res..

[38]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[39]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[40]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[41]  Jiaqiao Hu,et al.  A Model Reference Adaptive Search Method for Stochastic Global Optimization , 2008, Commun. Inf. Syst..

[42]  Bart De Schutter,et al.  Policy search with cross-entropy optimization of basis functions , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[43]  Ping Hu,et al.  On the performance of the Cross-Entropy method , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[44]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[45]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[46]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[47]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[48]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[49]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[50]  Simulation optimization via gradient-based stochastic search , 2014, WSC 2014.

[51]  Xi Chen,et al.  Simulation optimization via gradient-based stochastic search , 2014, Proceedings of the Winter Simulation Conference 2014.

[52]  Shalabh Bhatnagar,et al.  Revisiting the Cross Entropy Method with Applications in Stochastic Global Optimization and Reinforcement Learning , 2016, ECAI.

[53]  Shalabh Bhatnagar,et al.  A randomized algorithm for continuous optimization , 2016, 2016 Winter Simulation Conference (WSC).

[54]  Shalabh Bhatnagar,et al.  A Cross Entropy based Optimization Algorithm with Global Convergence Guarantees , 2018, ArXiv.