Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

A commonly cited inefficiency of neural network training using back-propagation is the update locking problem: each layer must wait for the signal to propagate through the full network before updating. Several alternatives that can alleviate this issue have been proposed. In this context, we consider a simple alternative based on minimal feedback, which we call Decoupled Greedy Learning (DGL). It is based on a classic greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification. We consider an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization. With the use of a replay buffer we show that this approach can be extended to asynchronous settings, where modules can operate and continue to update with possibly large communication delays. To address bandwidth and memory issues we propose an approach based on online vector quantization. This allows to drastically reduce the communication bandwidth between modules and required memory for replay buffers. We show theoretically and empirically that this approach converges and compare it to the sequential solvers. We demonstrate the effectiveness of DGL against alternative approaches on the CIFAR-10 dataset and on the large-scale ImageNet dataset.

[1]  Edouard Oyallon Building a Regular Decision Boundary with Deep Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[3]  Michal Valko,et al.  Compressing the Input for CNNs with the First-Order Scattering Transform , 2018, ECCV.

[4]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[5]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[6]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[7]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[8]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[9]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[10]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[13]  Michael Eickenberg,et al.  Decoupled Greedy Learning of CNNs , 2019, ICML.

[14]  Ioannis Mitliagkas,et al.  Manifold Mixup: Learning Better Representations by Interpolating Hidden States , 2018, 1806.05236.

[15]  Daniel Kifer,et al.  Conducting Credit Assignment by Aligning Local Representations , 2018, 1803.01834.

[16]  Fabian Pedregosa,et al.  ASAGA: Asynchronous Parallel SAGA , 2016, AISTATS.

[17]  Arild Nøkland,et al.  Direct Feedback Alignment Provides Learning in Deep Neural Networks , 2016, NIPS.

[18]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[19]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[20]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[21]  Bin Gu,et al.  Decoupled Parallel Backpropagation with Convergence Guarantee , 2018, ICML.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[24]  Alexander Ororbia,et al.  Biologically Motivated Algorithms for Propagating Local Target Representations , 2018, AAAI.

[25]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.

[26]  Bin Gu,et al.  Training Neural Networks Using Features Replay , 2018, NeurIPS.

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Max Jaderberg,et al.  Understanding Synthetic Gradients and Decoupled Neural Interfaces , 2017, ICML.

[29]  Arild Nøkland,et al.  Training Neural Networks with Local Error Signals , 2019, ICML.

[30]  Alekseĭ Grigorʹevich Ivakhnenko,et al.  CYBERNETIC PREDICTING DEVICES , 1966 .

[31]  Tomaso A. Poggio,et al.  Biologically-plausible learning algorithms can scale to large datasets , 2018, ICLR.

[32]  Pieter Abbeel,et al.  Parallel Training of Deep Networks with Local Updates , 2020, ArXiv.

[33]  Jonathon S. Hare,et al.  Deep Cascade Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[35]  Joelle Pineau,et al.  Online Learned Continual Compression with Adaptive Quantization Modules , 2019, ICML.

[36]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[37]  Gert Cauwenberghs,et al.  Deep Supervised Learning Using Local Errors , 2017, Front. Neurosci..

[38]  Brian Kingsbury,et al.  Beyond Backprop: Alternating Minimization with co-Activation Memory , 2018, ArXiv.

[39]  Geoffrey E. Hinton,et al.  Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[40]  John Langford,et al.  Learning Deep ResNet Blocks Sequentially using Boosting Theory , 2017, ICML.

[41]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[42]  Fabian Pedregosa,et al.  Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..