Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers. Current approaches to optimizing parameters in such computation graphs suffer from high variance gradients, bias, slow updates, or large memory usage. We introduce a method called Persistent Evolution Strategies (PES), which divides the computation graph into a series of truncated unrolls, and performs an evolution strategies-based update step after each unroll. PES eliminates bias from these truncations by accumulating correction terms over the entire sequence of unrolls. PES allows for rapid parameter updates, has low memory usage, is unbiased, and has reasonable variance characteristics. We experimentally demonstrate the advantages of PES compared to several other methods for gradient estimation on synthetic tasks, and show its applicability to training learned optimizers and tuning hyperparameters.

[1]  Erich Elsen,et al.  A Practical Sparse Approximation for Real Time Recurrent Learning , 2020, ArXiv.

[2]  Artem Molchanov,et al.  Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[3]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[4]  Jeremy Nixon,et al.  Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.

[5]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[6]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[7]  Roger B. Grosse,et al.  Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions , 2019, ICLR.

[8]  P. Frasconi,et al.  Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm , 2019, ArXiv.

[9]  Amos Storkey,et al.  Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons , 2020, ArXiv.

[10]  Angelika Steger,et al.  Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning , 2019, ICML.

[11]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[12]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[13]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[14]  Jascha Sohl-Dickstein,et al.  Meta-Learning Update Rules for Unsupervised Representation Learning , 2018, ICLR.

[15]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[16]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[17]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[18]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[19]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[22]  David Ha,et al.  Neuroevolution for deep reinforcement learning problems , 2018, GECCO.

[23]  Jascha Sohl-Dickstein,et al.  Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves , 2020, ArXiv.

[24]  Jascha Sohl-Dickstein,et al.  Guided evolutionary strategies: augmenting random search with surrogate gradients , 2018, ICML.

[25]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[26]  Angelika Steger,et al.  Approximating Real-Time Recurrent Learning with Random Kronecker Factors , 2018, NeurIPS.

[27]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[28]  Krzysztof Choromanski,et al.  Variance Reduction for Evolution Strategies via Structured Control Variates , 2019, AISTATS.

[29]  Laurent Hascoët,et al.  The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation , 2006, International Conference on Computational Science.

[30]  Jascha Sohl-Dickstein,et al.  Using a thousand optimization tasks to learn hyperparameter search strategies , 2020, ArXiv.

[31]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[32]  Kyunghyun Cho,et al.  A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks , 2019, J. Mach. Learn. Res..

[33]  Yann Ollivier,et al.  Unbiasing Truncated Backpropagation Through Time , 2017, ArXiv.

[34]  Emily B. Fox,et al.  Adaptively Truncating Backpropagation Through Time to Control Gradient Bias , 2019, UAI.

[35]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[36]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[37]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[38]  Carl E. Rasmussen,et al.  PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos , 2019, ICML.

[39]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[40]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[41]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[44]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[45]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Tutorial , 2016, ArXiv.

[46]  Mohammad Norouzi,et al.  Parallel Architecture and Hyperparameter Search via Successive Halving and Classification , 2018, ArXiv.

[47]  Wei Zhang,et al.  Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks , 2018, NeurIPS.

[48]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[49]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[50]  G. Evans,et al.  Learning to Optimize , 2008 .

[51]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[52]  Ryan P. Adams,et al.  Efficient Optimization of Loops and Limits with Randomized Telescoping Sums , 2019, ICML.

[53]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[54]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[55]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[56]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[57]  James Martens,et al.  On the Variance of Unbiased Online Recurrent Optimization , 2019, ArXiv.

[58]  Yann Ollivier,et al.  Unbiased Online Recurrent Optimization , 2017, ICLR.

[59]  Barak A. Pearlmutter,et al.  An investigation of the gradient descent process in neural networks , 1996 .