Optimizing Neural Networks in the Equivalent Class Space

It has been widely observed that many activation functions and pooling methods of neural network models have (positive-) rescaling-invariant property, including ReLU, PReLU, max-pooling, and average pooling, which makes fully-connected neural networks (FNNs) and convolutional neural networks (CNNs) invariant to (positive) rescaling operation across layers. This may cause unneglectable problems with their optimization: (1) different NN models could be equivalent, but their gradients can be very different from each other; (2) it can be proven that the loss functions may have many spurious critical points in the redundant weight space. To tackle these problems, in this paper, we first characterize the rescaling-invariant properties of NN models using equivalent classes and prove that the dimension of the equivalent class space is significantly smaller than the dimension of the original weight space. Then we represent the loss function in the compact equivalent class space and develop novel algorithms that conduct optimization of the NN models directly in the equivalent class space. We call these algorithms Equivalent Class Optimization (abbreviated as EC-Opt) algorithms. Moreover, we design efficient tricks to compute the gradients in the equivalent class, which almost have no extra computational complexity as compared to standard back-propagation (BP). We conducted experimental study to demonstrate the effectiveness of our proposed new optimization algorithms. In particular, we show that by using the idea of EC-Opt, we can significantly improve the accuracy of the learned model (for both FNN and CNN), as compared to using conventional stochastic gradient descent algorithms.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[3]  Joachim M. Buhmann,et al.  Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[4]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[5]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Nishio Takayuki,et al.  Deep Learning Tutorial , 2018 .

[12]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[13]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hua Guo,et al.  Permutation invariant polynomial neural network approach to fitting potential energy surfaces. , 2013, The Journal of chemical physics.

[16]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[17]  Roberto Cipolla,et al.  Symmetry-invariant optimization in deep networks , 2015, ArXiv.

[18]  Bernt Schiele,et al.  Taking a deeper look at pedestrians , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[24]  Barnabás Póczos,et al.  Equivariance Through Parameter-Sharing , 2017, ICML.

[25]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.