A compensation-based optimization strategy for top dense layer training

Abstract The stochastic gradient descent (SGD) method plays a central role in training deep convolutional neural networks (DCNNs). The recent advances in the field of optimization methods for DCNNs follow the direction of gradients. The innovations mainly lie in adopting different techniques to manage the history of gradients or automatically adapt the step size. In contrast, in this paper we propose a novel optimization approach for training the top dense layer of DCNN. It primarily utilizes the orientation directly pointing to the optimal values of parameters instead of the direction of gradients to navigate the parameters updates. The Moore-Penrose inverse method has been adopted to determine the difference between the current parameters and the optimal parameters, and the parameter updates are driven along this direction to compensate such a difference. Subsequently, the parameters have been fine-tuned along the direction of the classical gradient. Experiments have been conducted on extensively selected benchmark datasets. The results indicate that the proposed approach obtains a higher convergence rate and lower minimum loss compared to other state-of-the-art optimization methods. Furthermore, with the same DCNN architectures, the performance improvement margin between the proposed optimization method and other state-of-the-art optimization approaches is highly significant.

[1]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[2]  Fred W. Glover,et al.  Principles of scatter search , 2006, Eur. J. Oper. Res..

[3]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[4]  H. Robbins A Stochastic Approximation Method , 1951 .

[5]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[6]  Anindya Ghosh,et al.  Particle Swarm Optimization , 2019 .

[7]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[8]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[9]  Mohammad Ali Abido,et al.  Multi-objective particle swarm optimization for optimal power flow in a deregulated environment of power systems , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[10]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Thiago de C. Martins,et al.  Robot path planning using simulated annealing , 2006 .

[12]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[14]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[15]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Francisco Herrera,et al.  SHADE with Iterative Local Search for Large-Scale Global Optimization , 2018, 2018 IEEE Congress on Evolutionary Computation (CEC).

[17]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Guang-Bin Huang,et al.  Convex incremental extreme learning machine , 2007, Neurocomputing.

[19]  Chee Kheong Siew,et al.  Universal Approximation using Incremental Constructive Feedforward Networks with Random Hidden Nodes , 2006, IEEE Transactions on Neural Networks.

[20]  Yan Song,et al.  Image classification with CNN-based Fisher vector coding , 2016, 2016 Visual Communications and Image Processing (VCIP).

[21]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[22]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Bin Dong,et al.  Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate , 2019, IJCAI.

[24]  Yimin Yang,et al.  Recomputation of the Dense Layers for Performance Improvement of DCNN , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[26]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[27]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[28]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[29]  Le Zhang,et al.  A survey of randomized algorithms for training neural networks , 2016, Inf. Sci..

[30]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[31]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[32]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[33]  Sergei Gorlatch,et al.  Comparing GPU-parallelized metaheuristics to branch-and-bound for batch plants optimization , 2018, The Journal of Supercomputing.

[34]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[35]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Yimin Yang,et al.  Features Combined From Hundreds of Midlayers: Hierarchical Networks With Subnetwork Nodes , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[38]  Jun Hu,et al.  An Adaptive Optimization Algorithm Based on Hybrid Power and Multidimensional Update Strategy , 2019, IEEE Access.

[39]  Mitsuo Gen,et al.  Genetic Algorithms , 1999, Wiley Encyclopedia of Computer Science and Engineering.

[40]  Q. M. Jonathan Wu,et al.  EEG-Based Emotion Recognition Using Hierarchical Network With Subnetwork Nodes , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[41]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[42]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  K. S. Banerjee Generalized Inverse of Matrices and Its Applications , 1973 .