论文信息 - Proportionate gradient updates with PercentDelta

Proportionate gradient updates with PercentDelta

Deep Neural Networks are generally trained using iterative gradient updates. Magnitudes of gradients are affected by many factors, including choice of activation functions and initialization. More importantly, gradient magnitudes can greatly differ across layers, with some layers receiving much smaller gradients than others. causing some layers to train slower than others and therefore slowing down the overall convergence. We analytically explain this disproportionality. Then we propose to explicitly train all layers at the same speed, by scaling the gradient w.r.t. every trainable tensor to be proportional to its current value. In particular, at every batch, we want to update all trainable tensors, such that the relative change of the L1-norm of the tensors is the same, across all layers of the network, throughout training time. Experiments on MNIST show that our method appropriately scales gradients, such that the relative change in trainable tensors is approximately equal across layers. In addition, measuring the test accuracy with training time, shows that our method trains faster than other methods, giving higher test accuracy given same budget of training steps.

Sami Abu-El-Haija

[1] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[2] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[3] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4] Sami Abu-El-Haija,et al. Learning Edge Representations via Low-Rank Asymmetric Projections , 2017, CIKM.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[7] Patrick Seemann,et al. Matrix Factorization Techniques for Recommender Systems , 2014 .

[8] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[9] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[10] Li Fei-Fei,et al. Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.