Training Neural Networks Using Predictor-Corrector Gradient Descent

We improve the training time of deep feedforward neural networks using a modified version of gradient descent we call Predictor-Corrector Gradient Descent (PCGD). PCGD uses predictor-corrector inspired techniques to enhance gradient descent. This method uses a sparse history of network parameter values to make periodic predictions of future parameter values in an effort to skip unnecessary training iterations. This method can cut the number of training epochs needed for a network to reach a particular testing accuracy by nearly one half when compared to stochastic gradient descent (SGD). PCGD can also outperform, with some trade-offs, Nesterov’s Accelerated Gradient (NAG).

[1]  Alexandre d'Aspremont,et al.  Regularized nonlinear acceleration , 2016, Mathematical Programming.

[2]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Yu Zhang,et al.  Prediction-adaptation-correction recurrent neural networks for low-resource language speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  M. Frisch,et al.  Steepest descent reaction path integration using a first-order predictor-corrector method. , 2010, The Journal of chemical physics.

[6]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[7]  E. Süli,et al.  An introduction to numerical analysis , 2003 .

[8]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[9]  Yu Zhang,et al.  Speech recognition with prediction-adaptation-correction recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Massimiliano Di Luca,et al.  Optimal Perceived Timing: Integrating Sensory Information with Dynamically Updated Expectations , 2016, Scientific Reports.

[11]  Sebastian Nowozin,et al.  Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[12]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[13]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[18]  David J Heeger,et al.  Theory of cortical function , 2017, Proceedings of the National Academy of Sciences.

[19]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[20]  Costanzo Manes,et al.  An incremental least squares algorithm for large scale linear classification , 2013, Eur. J. Oper. Res..

[21]  Aryan Mokhtari,et al.  A Class of Prediction-Correction Methods for Time-Varying Convex Optimization , 2015, IEEE Transactions on Signal Processing.