论文信息 - Direct Policy Search using Paired Statistical Tests

Direct Policy Search using Paired Statistical Tests

Direct policy search is a practical way to solve reinforcement learning problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng & Jordan, 1999). We evaluate Pegasus, and other paired comparison methods, using the mountain car problem, and a difficult pursuer-evader problem. We conclude that: (i) Paired tests can improve performance of deterministic and stochastic optimization procedures. (ii) Our proposed alternatives to Pegasus can generalize better, by using a different test statistic, or changing the scenarios during learning. (iii) Adapting the number of trials used for each policy comparison yields fast and robust learning.

Andrew W. Moore | Malcolm J. A. Strens | M. Strens | A. Moore

[1] John A. Nelder,et al. A Simplex Method for Function Minimization , 1965, Comput. J..

[2] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[3] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[4] Mance E. Harmon,et al. Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .

[5] Mark Harmon. Multi-player residual advantage learning with general function , 1996 .

[6] Margaret H. Wright,et al. Direct search methods: Once scorned, now respectable , 1996 .

[7] Rainer Storn,et al. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[8] R. Storn,et al. Differential evolution a simple and efficient adaptive scheme for global optimization over continu , 1997 .

[9] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[10] M. J. D. Powell,et al. Direct search algorithms for optimization calculations , 1998, Acta Numerica.

[11] Andrew W. Moore,et al. Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[12] L. Baird. Reinforcement Learning Through Gradient Descent , 1999 .

[13] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[14] C. T. Kelley,et al. Detection and Remediation of Stagnation in the Nelder--Mead Algorithm Using a Sufficient Decrease Condition , 1999, SIAM J. Optim..

[15] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16] P. Bartlett,et al. Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[17] Andrew Y. Ng,et al. Policy Search via Density Estimation , 1999, NIPS.

[18] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[20] J. Baxter,et al. Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).