Optimizing Simulations with Noise-Tolerant Structured Exploration

We propose a simple drop-in noise-tolerant replacement for the standard finite difference procedure used ubiquitously in blackbox optimization. In our approach, parameter perturbation directions are defined by a family of structured orthogonal matrices. We show that at the small cost of computing a Fast Walsh-Hadamard/Fourier Transform (FWHT/FFT), such structured finite differences consistently give higher quality approximation of gradients and Jacobians in comparison to vanilla approaches that use coordinate directions or random Gaussian perturbations. We find that trajectory optimizers like Iterative LQR and Differential Dynamic Programming require fewer iterations to solve several classic continuous control tasks when our methods are used to linearize noisy, blackbox dynamics instead of standard finite differences. By embedding structured exploration in a quasi-Newton optimizer (LBFGS), we are able to learn agile walking and turning policies for quadruped locomotion, that successfully transfer from simulation to actual hardware. We theoretically justify our methods via bounds on the quality of gradient reconstruction and provide a basis for applying them also to nonsmooth problems.

[1]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[2]  Krzysztof Choromanski,et al.  The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings , 2017, NIPS.

[3]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[4]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[5]  Florentin Wörgötter,et al.  Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[6]  Vikas Sindhwani,et al.  Sequential operator splitting for constrained nonlinear optimal control , 2017, 2017 American Control Conference (ACC).

[7]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[8]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[9]  C. T. Kelley,et al.  Implicit Filtering , 2011 .

[10]  Anne Morvan,et al.  Structured adaptive and random spinners for fast machine learning computations , 2016, AISTATS.

[11]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[12]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[13]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Nikolaos V. Sahinidis,et al.  Simulation optimization: a review of algorithms and applications , 2014, 4OR.

[16]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[17]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[20]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[21]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.