Meta-descent for Online, Continual Prediction

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

[1]  Andrew G. Barto,et al.  Adaptive Step-Size for Online Temporal Difference Learning , 2012, AAAI.

[2]  Patrick M. Pilarski,et al.  TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent , 2018, ArXiv.

[3]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[4]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[5]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[6]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[7]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[8]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[10]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[11]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[12]  Martha White,et al.  Effective sketching methods for value function approximation , 2017, UAI.

[13]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[14]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[15]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[16]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[17]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[18]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[19]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[20]  Heike Freud,et al.  On Line Learning In Neural Networks , 2016 .

[21]  Martha White,et al.  Accelerated Gradient Temporal Difference Learning , 2016, AAAI.

[22]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[23]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[24]  Hao Shen,et al.  Accelerated gradient temporal difference learning algorithms , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[25]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[26]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[27]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[28]  David Saad,et al.  On-Line Learning in Neural Networks , 1999 .

[29]  Will Dabney,et al.  ADAPTIVE STEP-SIZES FOR REINFORCEMENT LEARNING , 2014 .

[30]  Brian Kingsbury,et al.  How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets , 2014, ArXiv.

[31]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[34]  Herbert JaegerGMD Observable Operator Processes and Conditioned Continuation Representations 1 , 1997 .

[35]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[37]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[38]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[39]  Philip S. Thomas,et al.  Natural Temporal Difference Learning , 2014, AAAI.

[40]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[41]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[42]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[43]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.