Direct Policy Search using Paired Statistical Tests

Direct policy search is a practical way to solve reinforcement learning problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng & Jordan, 1999). We evaluate Pegasus, and other paired comparison methods, using the mountain car problem, and a difficult pursuer-evader problem. We conclude that: (i) Paired tests can improve performance of deterministic and stochastic optimization procedures. (ii) Our proposed alternatives to Pegasus can generalize better, by using a different test statistic, or changing the scenarios during learning. (iii) Adapting the number of trials used for each policy comparison yields fast and robust learning.

[1]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[2]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[3]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[4]  Mance E. Harmon,et al.  Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .

[5]  Mark Harmon Multi-player residual advantage learning with general function , 1996 .

[6]  Margaret H. Wright,et al.  Direct search methods: Once scorned, now respectable , 1996 .

[7]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[8]  R. Storn,et al.  Differential evolution a simple and efficient adaptive scheme for global optimization over continu , 1997 .

[9]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[10]  M. J. D. Powell,et al.  Direct search algorithms for optimization calculations , 1998, Acta Numerica.

[11]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[12]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[13]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[14]  C. T. Kelley,et al.  Detection and Remediation of Stagnation in the Nelder--Mead Algorithm Using a Sufficient Decrease Condition , 1999, SIAM J. Optim..

[15]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[17]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[18]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[20]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).