SPI-Optimizer: An Integral-Separated PI Controller for Stochastic Optimization

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter. In this paper, we analyze that the fluctuation problem relates to the lag effect of the integral (I) term, and propose SPI-Optimizer, an integral-Separated PI controller based optimizer WITHOUT introducing extra hyper-parameter. It separates momentum term adaptively when the inconsistency of current and historical gradient direction occurs. Extensive experiments demonstrate that SPI-Optimizer generalizes well on popular network architectures to eliminate the oscillation, and owns competitive performance with faster convergence speed (up to 40% epochs reduction ratio) and more accurate classification result on MNIST, CIFAR10, and CIFAR100 (up to 27.5% error reduction ratio) than state-of-the-art methods.

[1]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[2]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[3]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Katsuhiko Ogata,et al.  Discrete-time control systems , 1987 .

[6]  P. Sunthar,et al.  The generalized proportional-integral-derivative (PID) gradient descent back propagation algorithm , 1995, Neural Networks.

[7]  Lars Rundqwist,et al.  Integrator Windup and How to Avoid It , 1989, 1989 American Control Conference.

[8]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[9]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[10]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Peter Richtárik,et al.  Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory , 2017, SIAM J. Matrix Anal. Appl..

[15]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[16]  Qionghai Dai,et al.  A PID Controller Approach for Stochastic Optimization of Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Lu Fang,et al.  Deep Learning for Surface Material Classification Using Haptic and Visual Information , 2015, IEEE Transactions on Multimedia.

[21]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Quoc V. Le,et al.  Neural Optimizer Search with Reinforcement Learning , 2017, ICML.

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[25]  Daniel Jiwoong Im,et al.  An empirical analysis of the optimization of deep network loss surfaces , 2016, 1612.04010.

[26]  Peter Richtárik,et al.  Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods , 2017, Computational Optimization and Applications.