I-theory on depth vs width: hierarchical function composition

Deep learning networks with convolution, pooling and subsampling are a special case of hierarchical architectures, which can be represented by trees (such as binary trees). Hierarchical as well as shallow networks can approximate functions of several variables, in particular those that are compositions of low dimensional functions. We show that the power of a deep network architecture with respect to a shallow network is rather independent of the specific nonlinear operations in the network and depends instead on the the behavior of the VC-dimension. A shallow network can approximate compositional functions with the same error of a deep network but at the cost of a VC-dimension that is exponential instead than quadratic in the dimensionality of the function. To complete the argument we argue that there exist visual computations that are intrinsically compositional. In particular, we prove that recognition invariant to translation cannot be computed by shallow networks in the presence of clutter. Finally, a general framework that includes the compositional case is sketched. The key condition that allows tall, thin networks to be nicer that short, fat networks is that the target input-output function must be sparse in a certain technical sense. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF 1231216. I-theory on depth vs width: hierarchical function composition Tomaso Poggio with Fabio Anselmi and Lorenzo Rosasco

[1]  Tomaso A. Poggio,et al.  Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant , 1989, Neural Computation.

[2]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[5]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[6]  W. Hackbusch,et al.  A New Scheme for the Tensor Representation , 2009 .

[7]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[8]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[9]  Amnon Shashua,et al.  SimNets: A Generalization of Convolutional Networks , 2014, ArXiv.

[10]  T. Poggio,et al.  Networks and the best approximation property , 1990, Biological Cybernetics.

[11]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[12]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[13]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[14]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[15]  Roi Livni,et al.  A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[16]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[17]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[18]  T. Poggio,et al.  On the representation of multi-input systems: Computational properties of polynomial algorithms , 1980, Biological Cybernetics.

[19]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[20]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[21]  D. Marr,et al.  Smallest channel in early human vision. , 1980, Journal of the Optical Society of America.

[22]  Tomaso A. Poggio,et al.  Representation properties of multilayer feedforward networks , 1988, Neural Networks.

[23]  Tomaso A. Poggio,et al.  Computational role of eccentricity dependent cortical magnification , 2014, ArXiv.

[24]  Daniel L. Ruderman,et al.  Origins of scaling in natural images , 1996, Vision Research.

[25]  Tomaso Poggio,et al.  Notes on Hierarchical Splines, DCLNs and i-theory , 2015 .