PID Controller-Based Stochastic Optimization Acceleration for Deep Neural Networks

Deep neural networks (DNNs) are widely used and demonstrated their power in many applications, such as computer vision and pattern recognition. However, the training of these networks can be time consuming. Such a problem could be alleviated by using efficient optimizers. As one of the most commonly used optimizers, stochastic gradient descent-momentum (SGD-M) uses past and present gradients for parameter updates. However, in the process of network training, SGD-M may encounter some drawbacks, such as the overshoot phenomenon. This problem would slow the training convergence. To alleviate this problem and accelerate the convergence of DNN optimization, we propose a proportional-integral-derivative (PID) approach. Specifically, we investigate the intrinsic relationships between the PID-based controller and SGD-M first. We further propose a PID-based optimization algorithm to update the network parameters, where the past, current, and change of gradients are exploited. Consequently, our proposed PID-based optimization alleviates the overshoot problem suffered by SGD-M. When tested on popular DNN architectures, it also obtains up to 50% acceleration with competitive accuracy. Extensive experiments about computer vision and natural language processing demonstrate the effectiveness of our method on benchmark data sets, including CIFAR10, CIFAR100, Tiny-ImageNet, and PTB. We have released the code at https://github.com/tensorboy/PIDOptimizer.

[1]  Katsuhiko Ogata,et al.  Discrete-time control systems , 1987 .

[2]  Jiawei Zhang,et al.  Gradient Descent based Optimization Algorithms for Deep Learning Models Training , 2019, ArXiv.

[3]  M. Moghavvemi,et al.  Modelling and PID controller design for a quadrotor unmanned air vehicle , 2010, 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR).

[4]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[6]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[7]  Paolo Rocco,et al.  Stability of PID control for industrial robot arms , 1996, IEEE Trans. Robotics Autom..

[8]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[9]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Saman Ghili,et al.  Tiny ImageNet Visual Recognition Challenge , 2014 .

[12]  P. Olver Nonlinear Systems , 2013 .

[13]  N. Minorsky.,et al.  DIRECTIONAL STABILITY OF AUTOMATICALLY STEERED BODIES , 2009 .

[14]  Niraj K. Jha,et al.  NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm , 2017, IEEE Transactions on Computers.

[15]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[17]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[18]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[19]  Xiaogang Wang,et al.  DeepID-Net: Object Detection with Deformable Part Based Convolutional Neural Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Daniel Jiwoong Im,et al.  An empirical analysis of the optimization of deep network loss surfaces , 2016, 1612.04010.

[21]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[22]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[23]  Yu Cao,et al.  CGaP: Continuous Growth and Pruning for Efficient Deep Learning , 2019, ArXiv.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  J. Maxwell I. On governors , 1868, Proceedings of the Royal Society of London.

[26]  Yun Li,et al.  PID control system analysis, design, and technology , 2005, IEEE Transactions on Control Systems Technology.

[27]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[30]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .

[31]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[35]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[36]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[38]  Ping Liu,et al.  Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[41]  Qionghai Dai,et al.  A PID Controller Approach for Stochastic Optimization of Deep Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  J. G. Ziegler,et al.  Optimum Settings for Automatic Controllers , 1942, Journal of Fluids Engineering.

[43]  Tao Mei,et al.  Design of a Control System for an Autonomous Vehicle Based on Adaptive-PID , 2012 .

[44]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[45]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[46]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[47]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[48]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[52]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[53]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[54]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[55]  Liuping Wang,et al.  New frequency-domain design method for PID controllers , 1995 .

[56]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[57]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.