The Ebb and Flow of Deep Learning: a Theory of Local Learning

In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Saburo Muroga,et al.  Lower Bounds of the Number of Threshold Functions and a Maximum Weight , 1962, IEEE Trans. Electron. Comput..

[5]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[8]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[9]  Baldi Symmetries and learning in neural network models. , 1987, Physical review letters.

[10]  E. Bienenstock,et al.  Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex , 1982, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[11]  Everton J. Agnes,et al.  Diverse synaptic plasticity mechanisms orchestrated to form and retrieve memories in spiking neural networks , 2015, Nature Communications.

[12]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[13]  Surya Ganguli,et al.  A memory frontier for complex synapses , 2013, NIPS.

[14]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[15]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[16]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[17]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[18]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[19]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[20]  Tomaso Poggio,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2013, 1311.4158.

[21]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[22]  Pierre Baldi,et al.  Boolean autoencoders and hypercube clustering complexity , 2012, Designs, Codes and Cryptography.

[23]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[24]  Nathan Intrator,et al.  Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions , 1992, Neural Networks.

[25]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[26]  S. Levin,et al.  On the boundedness of an iterative procedure for solving a system of linear inequalities , 1970 .

[27]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[28]  Xiaohui Xie,et al.  Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network , 2003, Neural Computation.

[29]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[30]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[31]  Pierre Baldi,et al.  Neural networks, orientations of the hypercube, and algebraic threshold functions , 1988, IEEE Trans. Inf. Theory.

[32]  D. Kleitman,et al.  On Dedekind’s problem: The number of monotone Boolean functions , 1969 .

[33]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2014 .

[34]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[35]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[36]  Erkki Oja,et al.  Independent component analysis by general nonlinear Hebbian-like learning rules , 1998, Signal Process..

[37]  Pierre Baldi,et al.  Group Actions and Learning for a Family of Automata , 1988, J. Comput. Syst. Sci..

[38]  Andrew Gelfand,et al.  On Herding and the Perceptron Cycling Theorem , 2010, NIPS.

[39]  Dimitris Thanos,et al.  Integration of Long-Term-Memory-Related Synaptic Plasticity Involves Bidirectional Regulation of Gene Expression and Chromatin Structure , 2002, Cell.

[40]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[41]  A. Hall,et al.  Adaptive Switching Circuits , 2016 .

[42]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[43]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[44]  A. Polyanin,et al.  Handbook of Exact Solutions for Ordinary Differential Equations , 1995 .

[45]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[46]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[47]  Eric R Kandel,et al.  Synapses and memory storage. , 2012, Cold Spring Harbor perspectives in biology.

[48]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[49]  Deep Target Algorithms for Deep Learning , 2012 .

[50]  Richard A. Andersen,et al.  A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons , 1988, Nature.

[51]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[52]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[53]  Bartlett W. Mel,et al.  Computational subunits in thin dendrites of pyramidal cells , 2004, Nature Neuroscience.

[54]  Pierre Baldi,et al.  The Neuron-specific Chromatin Regulatory Subunit BAF53b is Necessary for Synaptic Plasticity and Memory , 2013, Nature Neuroscience.

[55]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[56]  Leslie G. Valiant,et al.  The Hippocampus as a Stable Memory Allocator for Cortex , 2012, Neural Computation.