SPSA for Layer-Wise Training of Deep Networks

Concerned with neural learning without backpropagation, we investigate variants of the simultaneous perturbation stochastic approximation (SPSA) algorithm. Experimental results suggest that these allow for the successful training of deep feed-forward neural networks using forward passes only. In particular, we find that SPSA-based algorithms which update network parameters in a layer-wise manner are superior to variants which update all weights simultaneously.

[1]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[2]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[3]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[4]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[5]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[6]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[8]  Christian Bauckhage,et al.  Making Archetypal Analysis Practical , 2009, DAGM-Symposium.

[9]  Colin J. Akerman,et al.  Random synaptic feedback weights support error backpropagation for deep learning , 2016, Nature Communications.

[10]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  Dipti Srinivasan,et al.  Neural Networks for Continuous Online Learning and Control , 2006, IEEE Transactions on Neural Networks.

[13]  H. Robbins A Stochastic Approximation Method , 1951 .

[14]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[15]  Christian Bauckhage,et al.  Convex non-negative matrix factorization for massive datasets , 2011, Knowledge and Information Systems.

[16]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[19]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[20]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[21]  Yeng Chai Soh,et al.  Robust Neural Network Tracking Controller Using Simultaneous Perturbation Stochastic Approximation , 2008, IEEE Transactions on Neural Networks.

[22]  Robert Hooke,et al.  `` Direct Search'' Solution of Numerical and Statistical Problems , 1961, JACM.