Reducing the Computational Burden of Deep Learning with Recursive Local Representation Alignment

Training deep neural networks on large-scale datasets requires significant hardware resources whose costs (even on cloud platforms) put them out of reach of smaller organizations, groups, and individuals. Backpropagation (backprop), the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. Furthermore, it requires researchers to continually develop various tricks, such as specialized weight initializations and activation functions, in order to ensure a stable parameter optimization. Our goal is to seek an effective, parallelizable alternative to backprop that can be used to train deep networks. In this paper, we propose a gradient-free learning procedure, recursive local representation alignment, for training large-scale neural architectures. Experiments with deep residual networks on CIFAR-10 and the massive-scale benchmark, ImageNet, show that our algorithm generalizes as well as backprop while converging sooner due to weight updates that are parallelizable and computationally less demanding. This is empirical evidence that a backprop-free algorithm can scale up to larger datasets. Another contribution is that we also significantly reduce total parameter count of our networks by utilizing fast, fixed noise maps in place of convolutional operations without compromising generalization.

[1]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[2]  Daniel Kifer,et al.  Conducting Credit Assignment by Aligning Local Representations , 2018, 1803.01834.

[3]  Laurens van der Maaten,et al.  Barnes-Hut-SNE , 2013, ICLR.

[4]  Laurenz Wiskott,et al.  Hebbian-Descent , 2019, ArXiv.

[5]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[6]  Jian Wu,et al.  Learned Neural Iterative Decoding for Lossy Image Compression Systems , 2018, 2019 Data Compression Conference (DCC).

[7]  David Zipser,et al.  Feature Discovery by Competive Learning , 1986, Cogn. Sci..

[8]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[9]  David Reitter,et al.  Learning to Adapt by Minimizing Discrepancy , 2017, ArXiv.

[10]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[11]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Joachim M. Buhmann,et al.  Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[13]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  P. Földiák,et al.  Forming sparse representations by local anti-Hebbian learning , 1990, Biological Cybernetics.

[16]  Colin J. Akerman,et al.  Random synaptic feedback weights support error backpropagation for deep learning , 2016, Nature Communications.

[17]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[18]  Arild Nøkland,et al.  Direct Feedback Alignment Provides Learning in Deep Neural Networks , 2016, NIPS.

[19]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[20]  Yoshua Bengio,et al.  How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation , 2014, ArXiv.

[21]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.

[22]  Trevor Bekolay,et al.  Simultaneous unsupervised and supervised learning of cognitive functions in biologically plausible spiking neural networks , 2013, CogSci.

[23]  Chris Eliasmith,et al.  Fine-Tuning and the Stability of Recurrent Neural Networks , 2011, PloS one.

[24]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[25]  Yoshua Bengio,et al.  Dendritic cortical microcircuits approximate the backpropagation algorithm , 2018, NeurIPS.

[26]  David Sussillo,et al.  Random Walks: Training Very Deep Nonlinear Feed-Forward Networks with Smart Initialization , 2014, ArXiv.

[27]  John J. Hopfield,et al.  Unsupervised learning by competing hidden units , 2018, Proceedings of the National Academy of Sciences.

[28]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[29]  Xiaohui Xie,et al.  Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network , 2003, Neural Computation.

[30]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[31]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[32]  David Reitter,et al.  Online Learning of Deep Hybrid Architectures for Semi-supervised Categorization , 2015, ECML/PKDD.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alexander G. Ororbia,et al.  The Sibling Neural Estimator: Improving Iterative Image Decoding with Gradient Communication , 2020, 2020 Data Compression Conference (DCC).

[35]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[37]  Joel Z. Leibo,et al.  How Important Is Weight Symmetry in Backpropagation? , 2015, AAAI.

[38]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[39]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[40]  Tomaso A. Poggio,et al.  Biologically-plausible learning algorithms can scale to large datasets , 2018, ICLR.

[41]  Michael Eickenberg,et al.  Decoupled Greedy Learning of CNNs , 2019, ICML.

[42]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[43]  Daniel Kifer,et al.  Online Learning of Recurrent Neural Architectures by Locally Aligning Distributed Representations , 2018, ArXiv.

[44]  Stephen Grossberg,et al.  Competitive Learning: From Interactive Activation to Adaptive Resonance , 1987, Cogn. Sci..

[45]  Vishnu Naresh Boddeti,et al.  Perturbative Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Arild Nøkland,et al.  Training Neural Networks with Local Error Signals , 2019, ICML.

[47]  Yoshua Bengio,et al.  Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation , 2016, Front. Comput. Neurosci..

[48]  Geoffrey E. Hinton,et al.  Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[49]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[50]  Alexander Ororbia,et al.  Biologically Motivated Algorithms for Propagating Local Target Representations , 2018, AAAI.

[51]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[52]  José Carlos Príncipe,et al.  Deep Predictive Coding Networks , 2013, ICLR.

[53]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[54]  Surya Ganguli,et al.  Variational Walkback: Learning a Transition Operator as a Stochastic Recurrent Net , 2017, NIPS.

[55]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.