GradMax: Growing Neural Networks using Gradient Information

The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.1

[1]  Max Vladymyrov No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms , 2019, NeurIPS.

[2]  Boi Faltings,et al.  CompNet: Neural networks growing via the compact network morphism , 2018, ArXiv.

[3]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[4]  Qiang Liu,et al.  Splitting Steepest Descent for Growing Neural Architectures , 2019, NeurIPS.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2020, NeurIPS.

[7]  Frank Hutter,et al.  Simple And Efficient Architecture Search for Convolutional Neural Networks , 2017, ICLR.

[8]  Bo Liu,et al.  Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks , 2021, NeurIPS.

[9]  Feng Yan,et al.  AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , 2019, KDD.

[10]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[11]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[12]  Thomas Hofmann,et al.  Escaping Flat Areas via Function-Preserving Structural Network Modifications , 2018 .

[13]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[14]  Joan Bruna,et al.  Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias , 2019, NeurIPS.

[15]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[16]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[17]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[18]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[19]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[20]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[21]  Jun Wang,et al.  Reinforcement Learning for Architecture Search by Network Transformation , 2017, ArXiv.

[22]  Michael Maire,et al.  Growing Efficient Deep Networks by Structured Continuous Sparsification , 2020, ICLR.

[23]  Qiang Liu,et al.  Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent , 2019, ArXiv.

[24]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[25]  Nathan Srebro,et al.  ` 1 Regularization in Infinite Dimensional Feature Spaces , 2007 .

[26]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[27]  David Gregg,et al.  Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[28]  Ah Chung Tsoi,et al.  Lessons in Neural Network Training: Overfitting May be Harder than Expected , 1997, AAAI/IAAI.

[29]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[30]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[31]  Sanjeev Arora,et al.  An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[32]  Qiang Liu,et al.  Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting , 2020, ArXiv.

[33]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[34]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[35]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[36]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.