论文信息 - Action Elimination and Stopping Conditions for Reinforcement Learning

Action Elimination and Stopping Conditions for Reinforcement Learning

We consider incorporating action elimination procedures in reinforcement learning algorithms. We suggest a framework that is based on learning an upper and a lower estimates of the value function or the Q-function and eliminating actions that are not optimal. We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions that guarantee that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness.

[1] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2] J. MacQueen. A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[3] C. Watkins. Learning from delayed rewards , 1989 .

[4] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6] Anders R. Kristensen,et al. Dynamic programming and Markov decision processes , 1996 .

[7] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[9] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[10] Yishay Mansour,et al. Competitive queue policies for differentiated services , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[11] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[12] Boaz Patt-Shamir,et al. Buffer overflow management in QoS switches , 2001, STOC '01.

[13] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[14] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[15] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[16] Yishay Mansour,et al. Competitive queue policies for differentiated services , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[17] Richard S. Sutton,et al. Reinforcement Learning , 1992, Handbook of Machine Learning.