Parallelizing Over Artificial Neural Network Training Runs with Multigrid

Artificial neural networks are a popular and effective machine learning technique. Great progress has been made parallelizing the expensive training phase of an individual network, leading to highly specialized pieces of hardware, many based on GPU-type architectures, and more concurrent algorithms such as synthetic gradients. However, the training phase continues to be a bottleneck, where the training data must be processed serially over thousands of individual training runs. This work considers a multigrid reduction in time (MGRIT) algorithm that is able to parallelize over the thousands of training runs and converge to the exact same solution as traditional training would provide. MGRIT was originally developed to provide parallelism for time evolution problems that serially step through a finite number of time-steps. This work recasts the training of a neural network similarly, treating neural network training as an evolution equation that evolves the network weights from one step to the next. Thus, this work concerns distributed computing approaches for neural networks, but is distinct from other approaches which seek to parallelize only over individual training runs. The work concludes with supporting numerical results for two model problems.

[1]  A. Katz,et al.  Parallel Time Integration with Multigrid Reduction for a Compressible Fluid Dynamics Application , 2014 .

[2]  Martin J. Gander,et al.  Analysis of the Parareal Time-Parallel Time-Integration Method , 2007, SIAM J. Sci. Comput..

[3]  J. Lions,et al.  Résolution d'EDP par un schéma en temps « pararéel » , 2001 .

[4]  N. Anders Petersson,et al.  Two-Level Convergence Theory for Multigrid Reduction in Time (MGRIT) , 2017, SIAM J. Sci. Comput..

[5]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[7]  Robert D. Falgout,et al.  A parallel multigrid reduction in time method for power systems , 2016, 2016 IEEE Power and Energy Society General Meeting (PESGM).

[8]  Martin J. Gander,et al.  50 Years of Time Parallel Time Integration , 2015 .

[9]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[10]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[11]  Robert D. Falgout,et al.  Parallel time integration with multigrid , 2013, SIAM J. Sci. Comput..

[12]  Martin T. Hagan,et al.  Neural network design , 1995 .

[13]  Thomas A. Manteuffel,et al.  Multigrid Reduction in Time for Nonlinear Parabolic Problems: A Case Study , 2017, SIAM J. Sci. Comput..

[14]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[15]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[16]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.