Derivative-free reinforcement learning: a review

Reinforcement learning is about learning agent models that make the best sequential decisions in unknown environments. In an unknown environment, the agent needs to explore the environment while exploiting the collected information, which usually forms a sophisticated problem to solve. Derivative-free optimization, meanwhile, is capable of solving sophisticated problems. It commonly uses a sampling-and-updating framework to iteratively improve the solution, where exploration and exploitation are also needed to be well balanced. Therefore, derivative-free optimization deals with a similar core issue as reinforcement learning, and has been introduced in reinforcement learning approaches, under the names of learning classifier systems and neuroevolution/evolutionary reinforcement learning. Although such methods have been developed for decades, recently, derivative-free reinforcement learning exhibits attracting increasing attention. However, recent survey on this topic is still lacking. In this article, we summarize methods of derivative-free reinforcement learning to date, and organize the methods in aspects including parameter updating, model selection, exploration, and parallel/distributed methods. Moreover, we discuss some current limitations and possible future directions, hoping that this article could bring more attentions to this topic and serve as a catalyst for developing novel and efficient approaches.

[1]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[2]  Alexander J. Smola,et al.  Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations , 2012, ICML.

[3]  Anne Auger,et al.  Mirrored Sampling and Sequential Selection for Evolution Strategies , 2010, PPSN.

[4]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[5]  Yi-Chi Wang,et al.  Application of reinforcement learning for agent-based production scheduling , 2005, Eng. Appl. Artif. Intell..

[6]  Jürgen Schmidhuber,et al.  Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[7]  Martin J. Wainwright,et al.  Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations , 2013, IEEE Transactions on Information Theory.

[8]  Kenneth O. Stanley,et al.  Safe mutations for deep and recurrent neural networks through output gradients , 2017, GECCO.

[9]  Luís Paulo Reis,et al.  Model-Based Relative Entropy Stochastic Search , 2016, NIPS.

[10]  Marius Lindauer,et al.  An Evolution Strategy with Progressive Episode Lengths for Playing Games , 2019, IJCAI.

[11]  Frank Hutter,et al.  Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari , 2018, IJCAI.

[12]  Kirthevasan Kandasamy,et al.  High Dimensional Bayesian Optimisation and Bandits via Additive Models , 2015, ICML.

[13]  C. Karen Liu,et al.  Policy Transfer with Strategy Optimization , 2018, ICLR.

[14]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[15]  Kenneth O. Stanley,et al.  ES is more than just a traditional finite-difference approximator , 2017, GECCO.

[16]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[17]  Xin Yao,et al.  Turning High-Dimensional Optimization Into Computationally Expensive Optimization , 2016, IEEE Transactions on Evolutionary Computation.

[18]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[19]  X. Yao Evolving Artificial Neural Networks , 1999 .

[20]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[21]  James E. Baker,et al.  Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[22]  Bernard Ghanem,et al.  A Stochastic Derivative-Free Optimization Method with Importance Sampling: Theory and Learning to Control , 2019, AAAI.

[23]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[24]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[25]  Alan Fern,et al.  Using trajectory data to improve bayesian optimization for reinforcement learning , 2014, J. Mach. Learn. Res..

[26]  Yuren Zhou,et al.  A Restart-based Rank-1 Evolution Strategy for Reinforcement Learning , 2019, IJCAI.

[27]  Kagan Tumer,et al.  Evolution-Guided Policy Gradient in Reinforcement Learning , 2018, NeurIPS.

[28]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[29]  J. Andrew Bagnell,et al.  Contrasting Exploration in Parameter and Action Space: A Zeroth-Order Optimization Perspective , 2019, AISTATS.

[30]  Tobias Glasmachers,et al.  Challenges in High-dimensional Reinforcement Learning with Evolution Strategies , 2018, PPSN.

[31]  Simon Lucey,et al.  Learning Policies for Adaptive Tracking with Deep Feature Cascades , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Xin Yao,et al.  Drift analysis and average time complexity of evolutionary algorithms , 2001, Artif. Intell..

[33]  Andreas Krause,et al.  Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Risto Miikkulainen,et al.  Designing neural networks through neuroevolution , 2019, Nat. Mach. Intell..

[35]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[36]  Anne Auger,et al.  Evolution Strategies , 2018, Handbook of Computational Intelligence.

[37]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[38]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[39]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[40]  Qingfu Zhang,et al.  Fast Covariance Matrix Adaptation for Large-Scale Black-Box Optimization , 2020, IEEE Transactions on Cybernetics.

[41]  Eytan Bakshy,et al.  Bayesian Optimization for Policy Search via Online-Offline Experimentation , 2019, J. Mach. Learn. Res..

[42]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[43]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[44]  Yang Yu,et al.  Derivative-Free Optimization of High-Dimensional Non-Convex Functions by Sequential Random Embeddings , 2016, IJCAI.

[45]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[46]  Jan Peters,et al.  An experimental comparison of Bayesian optimization for bipedal locomotion , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[48]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[49]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[50]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[51]  Yuting Zhang,et al.  Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Bo An,et al.  An extended study on multi-objective security games , 2012, Autonomous Agents and Multi-Agent Systems.

[53]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[54]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[55]  Aimin Zhou,et al.  Fuzzy-Classification Assisted Solution Preselection in Evolutionary Optimization , 2019, AAAI.

[56]  Katya Scheinberg,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[57]  Pedro Larrañaga,et al.  Towards a New Evolutionary Computation - Advances in the Estimation of Distribution Algorithms , 2006, Towards a New Evolutionary Computation.

[58]  Yang Yu,et al.  On Subset Selection with General Cost Constraints , 2017, IJCAI.

[59]  Kenneth O. Stanley,et al.  Abandoning Objectives: Evolution Through the Search for Novelty Alone , 2011, Evolutionary Computation.

[60]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[61]  Krzysztof Choromanski,et al.  Variance Reduction for Evolution Strategies via Structured Control Variates , 2019, AISTATS.

[62]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[63]  Pietro Lio',et al.  Proximal Distilled Evolutionary Reinforcement Learning , 2019, AAAI.

[64]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[65]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[66]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[67]  Stewart W. Wilson,et al.  Noname manuscript No. (will be inserted by the editor) Learning Classifier Systems: A Survey , 2022 .

[68]  Yu Maruyama,et al.  Global Continuous Optimization with Error Bound and Fast Convergence , 2016, J. Artif. Intell. Res..

[69]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[70]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[71]  Yang Yu,et al.  The sampling-and-learning framework: A statistical view of evolutionary algorithms , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[72]  Yang Yu,et al.  Switch Analysis for Running Time Analysis of Evolutionary Algorithms , 2015, IEEE Transactions on Evolutionary Computation.

[73]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[74]  Shimon Whiteson,et al.  Comparing evolutionary and temporal difference methods in a reinforcement learning domain , 2006, GECCO.

[75]  R. Bellman A Markovian Decision Process , 1957 .

[76]  Kenji Doya,et al.  Online meta-learning by parallel algorithm competition , 2018, GECCO.

[77]  Richard E. Turner,et al.  Structured Evolution with Compact Architectures for Scalable Policy Optimization , 2018, ICML.

[78]  Marco Wiering,et al.  Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.

[79]  Yang Yu,et al.  Towards Sample Efficient Reinforcement Learning , 2018, IJCAI.

[80]  Marc Toussaint,et al.  Bayesian Functional Optimization , 2018, AAAI.

[81]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[82]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[83]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[84]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[85]  Jian Peng,et al.  Policy Optimization by Genetic Distillation , 2017, ICLR.

[86]  Risto Miikkulainen,et al.  Evolving neural networks for strategic decision-making problems , 2009, Neural Networks.

[87]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[88]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[89]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[90]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[91]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[92]  Yang Yu,et al.  A new approach to estimating the expected first hitting time of evolutionary algorithms , 2006, Artif. Intell..

[93]  Vianney Perchet,et al.  Highly-Smooth Zero-th Order Online Optimization , 2016, COLT.

[94]  Trevor Darrell,et al.  Gradient-free Policy Architecture Search and Adaptation , 2017, CoRL.

[95]  Kenneth O. Stanley,et al.  A Case Study on the Critical Role of Geometric Regularity in Machine Learning , 2008, AAAI.

[96]  Tamara G. Kolda,et al.  Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods , 2003, SIAM Rev..

[97]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[98]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[99]  Yang Yu,et al.  Subset Selection by Pareto Optimization , 2015, NIPS.

[100]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[101]  Christian Igel,et al.  Evolution Strategies for Direct Policy Search , 2008, PPSN.

[102]  D. R. McGregor,et al.  Designing application-specific neural networks using the structured genetic algorithm , 1992, [Proceedings] COGANN-92: International Workshop on Combinations of Genetic Algorithms and Neural Networks.

[103]  Kenneth O. Stanley,et al.  On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent , 2017, ArXiv.

[104]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[105]  Yang Yu,et al.  Derivative-Free Optimization via Classification , 2016, AAAI.

[106]  Adel Bibi,et al.  A Stochastic Derivative Free Optimization Method with Momentum , 2019, ICLR.

[107]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[108]  Jan Peters,et al.  Bayesian optimization for learning gaits under uncertainty , 2015, Annals of Mathematics and Artificial Intelligence.

[109]  Viet-Hung Dang,et al.  A Covariance Matrix Adaptation Evolution Strategy for Direct Policy Search in Reproducing Kernel Hilbert Space , 2017, ACML.

[110]  Andreas Krause,et al.  Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features , 2018, NeurIPS.

[111]  Krzysztof Choromanski,et al.  From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization , 2019, NeurIPS.

[112]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[113]  Nando de Freitas,et al.  Bayesian Optimization in a Billion Dimensions via Random Embeddings , 2013, J. Artif. Intell. Res..

[114]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[115]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning Through Evolving Neural Network Topologies , 2002, GECCO.

[116]  Youngchul Sung,et al.  Population-Guided Parallel Policy Search for Reinforcement Learning , 2020, ICLR.

[117]  Martin J. Wainwright,et al.  Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems , 2018, AISTATS.

[118]  Olivier Sigaud,et al.  Importance mixing: Improving sample reuse in evolutionary policy search methods , 2018, ArXiv.

[119]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[120]  Kenneth O. Stanley,et al.  Simple Evolutionary Optimization Can Rival Stochastic Gradient Descent in Neural Networks , 2016, GECCO.

[121]  J. Geweke,et al.  Antithetic acceleration of Monte Carlo integration in Bayesian inference , 1988 .

[122]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[123]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[124]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[125]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[126]  Jan Peters,et al.  Bayesian optimization for learning gaits under uncertainty , 2015, Annals of Mathematics and Artificial Intelligence.

[127]  Paul G. Constantine,et al.  Active Subspaces - Emerging Ideas for Dimension Reduction in Parameter Studies , 2015, SIAM spotlights.

[128]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[129]  Nikolaus Hansen,et al.  Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[130]  Xin Yao,et al.  Turning High-Dimensional Optimization Into Computationally Expensive Optimization , 2018, IEEE Transactions on Evolutionary Computation.

[131]  Hong Wang,et al.  Noisy Derivative-Free Optimization With Value Suppression , 2018, AAAI.

[132]  Yang Yu,et al.  Sequential Classification-Based Optimization for Direct Policy Search , 2017, AAAI.

[133]  Shimon Whiteson,et al.  Evolutionary Computation for Reinforcement Learning , 2012, Reinforcement Learning.

[134]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[135]  Verena Heidrich-Meisner,et al.  Neuroevolution strategies for episodic reinforcement learning , 2009, J. Algorithms.

[136]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[137]  Julian Togelius,et al.  Neuroevolution in Games: State of the Art and Open Challenges , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[138]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[139]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[140]  Thomas Bartz-Beielstein,et al.  Surrogate models for enhancing the efficiency of neuroevolution in reinforcement learning , 2019, GECCO.

[141]  Shimon Whiteson,et al.  Sample-Efficient Evolutionary Function Approximation for Reinforcement Learning , 2006, AAAI.

[142]  Yang Yu,et al.  Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning , 2018, AAAI.

[143]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[144]  Nikolaos V. Sahinidis,et al.  Derivative-free optimization: a review of algorithms and comparison of software implementations , 2013, J. Glob. Optim..

[145]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[146]  Yang Yu,et al.  Scaling Simultaneous Optimistic Optimization for High-Dimensional Non-Convex Functions with Low Effective Dimensions , 2016, AAAI.

[147]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[148]  Yang Yu,et al.  Reinforcement Learning with Derivative-Free Exploration , 2019, AAMAS.

[149]  Nenghai Yu,et al.  Trust Region Evolution Strategies , 2019, AAAI.

[150]  Xin Yao,et al.  Evolving artificial neural networks , 1999, Proc. IEEE.

[151]  Pierre-Yves Oudeyer,et al.  GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms , 2017, ICML.

[152]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[153]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[154]  John C. Duchi,et al.  Derivative Free Optimization Via Repeated Classification , 2018, AISTATS.

[155]  Matthias Poloczek,et al.  Scalable Global Optimization via Local Bayesian Optimization , 2019, NeurIPS.

[156]  Risto Miikkulainen,et al.  A Neuroevolution Approach to General Atari Game Playing , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[157]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[158]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[159]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[160]  Brigitte C. Madrian,et al.  Reinforcement Learning and Savings Behavior , 2007, The Journal of finance.

[161]  Risto Miikkulainen,et al.  HyperNEAT-GGP: a hyperNEAT-based atari general game player , 2012, GECCO '12.

[162]  Leslie Pack Kaelbling,et al.  Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[163]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[164]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[165]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).