Local Critic Training for Model-Parallel Learning of Deep Neural Networks

This paper proposes a novel approach to train deep neural networks in a parallelized manner by unlocking the layer-wise dependency of backpropagation training. The approach employs additional modules called local critic networks besides the main network model to be trained, which estimate the output of the main network in order to obtain error gradients without complete feedforward and backward propagation processes. We propose a cascaded learning strategy for these local networks so that parallelized training of different layer groups is possible. Experimental results show the effectiveness of the proposed approach and suggest guidelines for determining appropriate algorithm parameters. In addition, we demonstrate that the approach can be also used for structural optimization of neural networks, computationally efficient progressive inference, and ensemble classification for performance improvement.

[1]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[2]  Charles R. Qi,et al.  Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.

[3]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[4]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[6]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[7]  Minsik Cho,et al.  Data-parallel distributed training of very large models beyond GPU capacity , 2018, ArXiv.

[8]  Xin Jia Image recognition method based on deep learning , 2017, 2017 29th Chinese Control And Decision Conference (CCDC).

[9]  Mattan Erez,et al.  PruneTrain: fast neural network training by dynamic sparse model reconfiguration , 2019, SC.

[10]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[11]  W. Marsden I and J , 2012 .

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Razvan Pascanu,et al.  Sobolev Training for Neural Networks , 2017, NIPS.

[14]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[15]  Bin Gu,et al.  Training Neural Networks Using Features Replay , 2018, NeurIPS.

[16]  Santanu Chaudhury,et al.  A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark , 2017, ArXiv.

[17]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[18]  Mehryar Mohri,et al.  AdaNet: Adaptive Structural Learning of Artificial Neural Networks , 2016, ICML.

[19]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[20]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[21]  Ryota Tomioka,et al.  AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks , 2017, ArXiv.

[22]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[23]  R. Levy An Integrated Model , 2016 .

[24]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[25]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[26]  Hojung Lee,et al.  Local Critic Training of Deep Neural Networks , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[27]  YeungDit-Yan,et al.  Constructive algorithms for structure learning in feedforward neural networks for regression problems , 1997 .

[28]  Puneet Gupta,et al.  Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training , 2019, IEEE Micro.

[29]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[30]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[32]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[33]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[34]  Trevor Darrell,et al.  Learning the Structure of Deep Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  James Cheng,et al.  TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism , 2020, IEEE Transactions on Parallel and Distributed Systems.

[36]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[37]  Ling Shao,et al.  Dynamical Hyperparameter Optimization via Deep Reinforcement Learning in Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Shashi Pal Singh,et al.  Machine translation using deep learning: An overview , 2017, 2017 International Conference on Computer, Communications and Electronics (Comptelix).

[39]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[40]  Ling Shao,et al.  Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[42]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[44]  Jiajun Zhang,et al.  Deep Neural Networks in Machine Translation: An Overview , 2015, IEEE Intelligent Systems.

[45]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[48]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[49]  James T. Kwok,et al.  Constructive algorithms for structure learning in feedforward neural networks for regression problems , 1997, IEEE Trans. Neural Networks.

[50]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.