Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

Due to the need to store the intermediate activations for back-propagation, end-toend (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.

[1]  Le Yang,et al.  Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification , 2020, NeurIPS.

[2]  Gao Huang,et al.  Meta-Semi: A Meta-learning Approach for Semi-supervised Learning , 2020, CAAI Artificial Intelligence Research.

[3]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[4]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[5]  Le Yang,et al.  Resolution Adaptive Networks for Efficient Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[7]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael Eickenberg,et al.  Decoupled Greedy Learning of CNNs , 2019, ICML.

[9]  Gao Huang,et al.  Implicit Semantic Data Augmentation for Deep Networks , 2019, NeurIPS.

[10]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[11]  Bastiaan S. Veeling,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[12]  Kilian Q. Weinberger,et al.  Convolutional Networks with Dense Connectivity , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Arild Nøkland,et al.  Training Neural Networks with Local Error Signals , 2019, ICML.

[14]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.

[15]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[16]  Bin Gu,et al.  Training Neural Networks Using Features Replay , 2018, NeurIPS.

[17]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[18]  Brian Kingsbury,et al.  Beyond Backprop: Alternating Minimization with co-Activation Memory , 2018, ArXiv.

[19]  Geoffrey E. Hinton,et al.  Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[20]  Bin Gu,et al.  Decoupled Parallel Backpropagation with Convergence Guarantee , 2018, ICML.

[21]  Shai Shalev-Shwartz,et al.  A Provably Correct Algorithm for Deep Learning that Actually Works , 2018, ArXiv.

[22]  Jonathon S. Hare,et al.  Deep Cascade Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[23]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[24]  Gert Cauwenberghs,et al.  Deep Supervised Learning Using Local Errors , 2017, Front. Neurosci..

[25]  John Langford,et al.  Learning Deep ResNet Blocks Sequentially using Boosting Theory , 2017, ICML.

[26]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[27]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[28]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[29]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[30]  Mandar Kulkarni,et al.  Layer-wise training of deep networks using kernel similarity , 2017, ArXiv.

[31]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[32]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[33]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[35]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[36]  George D. Magoulas,et al.  Deep Incremental Boosting , 2016, GCAI.

[37]  Arild Nøkland,et al.  Direct Feedback Alignment Provides Learning in Deep Neural Networks , 2016, NIPS.

[38]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[39]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[40]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[41]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yoshua Bengio,et al.  Towards Biologically Plausible Deep Learning , 2015, ArXiv.

[45]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[46]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[49]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[52]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[56]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[57]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[60]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[61]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[62]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[63]  Y. Dan,et al.  Spike Timing-Dependent Plasticity of Neural Circuits , 2004, Neuron.

[64]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[65]  Francis Crick,et al.  The recent excitement about neural networks , 1989, Nature.

[66]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.