论文信息 - An analysis of reinforcement learning with function approximation

An analysis of reinforcement learning with function approximation

We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works.

Sean P. Meyn | Francisco S. Melo | M. Isabel Ribeiro | M. I. Ribeiro | M. Ribeiro

[1] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[2] Richard L. Tweedie,et al. Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[3] S. Meyn,et al. Computable Bounds for Geometric Convergence Rates of Markov Chains , 1994 .

[4] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[5] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[6] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[7] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[8] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9] P. Diaconis,et al. LOGARITHMIC SOBOLEV INEQUALITIES FOR FINITE MARKOV CHAINS , 1996 .

[10] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12] V. Borkar. Stochastic approximation with two time scales , 1997 .

[13] Richard S. Sutton,et al. Open Theoretical Questions in Reinforcement Learning , 1999, EuroCOLT.

[14] V. Borkar. A LEARNING ALGORITHM FOR DISCRETE-TIME STOCHASTIC CONTROL , 2000, Probability in the Engineering and Informational Sciences.

[15] Benjamin Van Roy,et al. On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[16] Michael I. Jordan,et al. On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001 .

[17] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[18] J. Rosenthal. QUANTITATIVE CONVERGENCE RATES OF MARKOV CHAINS: A SIMPLE ACCOUNT , 2002 .

[19] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[20] Theodore J. Perkins,et al. On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[21] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[22] Vladislav Tadic,et al. On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[23] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[24] William D. Smart,et al. Interpolation-based Q-learning , 2004, ICML.

[25] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.