Gated Linear Networks

This paper presents a new family of backpropagation-free neural architectures, Gated Linear Networks (GLNs). What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep ReLU networks. Furthermore, we demonstrate that the GLN learning mechanism possesses extraordinary resilience to catastrophic forgetting, performing comparably to a MLP with dropout and Elastic Weight Consolidation on standard benchmarks. These desirable theoretical and empirical properties position GLNs as a complementary technique to contemporary offline deep learning methods.

[1]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[2]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Sindy Löwe,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[5]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[8]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[9]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[10]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[11]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[12]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[13]  Stephen Grossberg,et al.  The ART of adaptive pattern recognition by a self-organizing neural network , 1987, Computer.

[14]  Michael Eickenberg,et al.  Decoupled Greedy Learning of CNNs , 2019, ICML.

[15]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[16]  Joel Veness,et al.  A Combinatorial Perspective on Transfer Learning , 2020, NeurIPS.

[17]  Christopher Mattern Statistical Data Compression , 2008, Encyclopedia of Algorithms.

[18]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[19]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[20]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[21]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[22]  Sarah Eichmann,et al.  The Radon Transform And Some Of Its Applications , 2016 .

[23]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[24]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[26]  Maxim Smirnov,et al.  Data Compression Explained , 2010 .

[27]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[28]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[29]  Yohan. jin,et al.  2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops 2013 Ieee Conference on Computer Vision and Pattern Recognition Workshops , 2022 .

[30]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[31]  Martin J. Wainwright,et al.  Convexified Convolutional Neural Networks , 2016, ICML.

[32]  Anthony V. Robins,et al.  Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[33]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[34]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[35]  Arild Nøkland,et al.  Training Neural Networks with Local Error Signals , 2019, ICML.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[38]  Christopher Mattern,et al.  Linear and Geometric Mixtures - Analysis , 2013, 2013 Data Compression Conference.

[39]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[40]  Nando de Freitas,et al.  A Machine Learning Perspective on Predictive Coding with PAQ8 , 2011, 2012 Data Compression Conference.

[41]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[42]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[43]  Yee Whye Teh,et al.  Progress & Compress: A scalable framework for continual learning , 2018, ICML.

[44]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[45]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[46]  Matthew V. Mahoney,et al.  Fast Text Compression with Neural Networks , 2000, FLAIRS Conference.

[47]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[48]  Ashok Cutkosky,et al.  Anytime Online-to-Batch, Optimism and Acceleration , 2019, ICML.

[49]  Christopher Mattern Mixing Strategies in Data Compression , 2012, 2012 Data Compression Conference.

[50]  Tor Lattimore,et al.  Online Learning with Gated Linear Networks , 2017, ArXiv.

[51]  Joel Veness,et al.  Online Learning in Contextual Bandits using Gated Linear Networks , 2020, NeurIPS.

[52]  Matthew V. Mahoney,et al.  Adaptive weighing of context models for lossless data compression , 2005 .

[53]  Yee Whye Teh,et al.  Meta-learning of Sequential Strategies , 2019, ArXiv.

[54]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.

[55]  Jürgen Schmidhuber,et al.  Sequential neural text compression , 1996, IEEE Trans. Neural Networks.

[56]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).