Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning

Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and black-box optimization (BBO). Whereas BBO algorithms are generic optimization methods that, due to there generality, may also be applied to optimizing policy parameters, RL algorithms are specifically tailored to leveraging the structure of policy improvement problems. In recent years, benchmark comparisons between RL and BBO have been made, and there has been several attempts to specify which approach works best for which types of problem classes. In this article, we make several contributions to this line of research: 1) We define four algorithmic properties that further clarify the relationship between RL and BBO: action-perturbation vs. parameter-perturbation, gradient estimation vs. rewardweighted averaging, use of only rewards vs. use of rewards and state information, actor-critic vs. direct policy search. 2) We show how the chronology of the derivation of ever more powerful algorithms displays a trend towards algorithms based on parameter-perturbation and reward-weighted averaging. A striking feature of this trend is that it has moved RL methods closer and closer to BBO. 3) We continue this trend by applying two modifications to the state-of-the-art "Policy Improvement with Path Integrals" (PI2), which yields an algorithm we denote PIBB. We show that PIBB is a BBO algorithm, and, more specifically, that it is a special case of the "Covariance Matrix Adaptation - Evolutionary Strategy" algorithm. Our empirical evaluation demonstrates that the simpler PIBB outperforms PI2 on simple evaluation tasks in terms of convergence speed and final cost. 4) Although our evaluation implies that, for these five tasks, BBO outperforms RL, we do not hold this to be a general statement, and provide an analysis of why these tasks are particularly well-suited for BBO. Thus, rather than making the case for BBO or RL, one of the main contributions of this article is rather to provide an algorithmic framework in which such cases may be made, as PIBB and PI2 use identical perturbation and parameter update methods, and differ only in being BBO and RL approaches respectively.

[1]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[2]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[3]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[4]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[5]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[6]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[7]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[10]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[11]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[12]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[13]  Stefan Schaal,et al.  Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning , 2007, ESANN.

[14]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[15]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[16]  Risto Miikkulainen,et al.  Accelerated Neural Evolution through Cooperatively Coevolved Synapses , 2008, J. Mach. Learn. Res..

[17]  Christian Igel,et al.  Similarities and differences between policy gradient methods and evolution strategies , 2008, ESANN.

[18]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[19]  Christian Igel,et al.  Evolution Strategies for Direct Policy Search , 2008, PPSN.

[20]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[21]  Julian Togelius,et al.  Ontogenetic and Phylogenetic Reinforcement Learning , 2009, Künstliche Intell..

[22]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[23]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[24]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[25]  Peter Stone,et al.  Characterizing reinforcement learning methods through parameterized learning problems , 2011, Machine Learning.

[26]  Lionel Rigoux,et al.  Learning cost-efficient control policies with XCSF: generalization capabilities and further improvement , 2011, GECCO '11.

[27]  Stefan Schaal,et al.  Learning variable impedance control , 2011, Int. J. Robotics Res..

[28]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[29]  Stefan Schaal,et al.  Learning motion primitive goals for robust manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Bart De Schutter,et al.  Cross-Entropy Optimization of Control Policies With Adaptive Basis Functions , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Ales Ude,et al.  Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives , 2011, Robotics Auton. Syst..

[32]  Jérémy Fix,et al.  Monte-Carlo Swarm Policy Search , 2012, ICAISC.

[33]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[34]  Stefan Schaal,et al.  Reinforcement Learning With Sequences of Motion Primitives for Robust Manipulation , 2012, IEEE Transactions on Robotics.

[35]  Olivier Sigaud,et al.  Towards fast and adaptive optimal control policies for robots : A direct policy search approach , 2012 .