Why Does Deep and Cheap Learning Work So Well?

We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that n variables cannot be multiplied using fewer than $$2^n$$2n neurons in a single hidden layer.

[1]  F. S. Prout Philosophical Transactions of the Royal Society of London , 2009, The London Medical Journal.

[2]  HighWire Press Philosophical transactions of the Royal Society of London. Series A, Containing papers of a mathematical or physical character , 1896 .

[3]  G. B. Guccia Rendiconti del circolo matematico di Palermo , 1906 .

[4]  M. Borel Les probabilités dénombrables et leurs applications arithmétiques , 1909 .

[5]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[6]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[7]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[8]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[9]  V. Strassen Gaussian elimination is not optimal , 1969 .

[10]  Johan Håstad,et al.  Almost optimal lower bounds for small depth circuits , 1986, STOC '86.

[11]  Juris Hartmanis,et al.  Proceedings of the eighteenth annual ACM symposium on Theory of computing , 1986, STOC 1986.

[12]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[13]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[16]  J. Herskowitz,et al.  Proceedings of the National Academy of Sciences, USA , 1996, Current Biology.

[17]  J. Cardy Scaling and Renormalization in Statistical Physics , 1996 .

[18]  Max Tegmark How to Make Maps from Cosmic Microwave Background Data without Losing Information , 1996, astro-ph/9611130.

[19]  U. Seljak,et al.  A Line of sight integration approach to cosmic microwave background anisotropies , 1996, astro-ph/9603033.

[20]  Gudrun Wolfschmidt,et al.  From astronomy to astrophysics. , 1997 .

[21]  Max Tegmark How to measure CMB power spectra without losing information , 1996, astro-ph/9611174.

[22]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[23]  U. Toronto,et al.  Estimating the power spectrum of the cosmic microwave background , 1997, astro-ph/9708203.

[24]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[25]  Tomaso Poggio,et al.  Models of object recognition , 2000, Nature Neuroscience.

[26]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[27]  Peter Deuflhard,et al.  Numerische Mathematik. I , 2002 .

[28]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[29]  Edward J. Wollack,et al.  First-Year Wilkinson Microwave Anisotropy Probe (WMAP) Observations: Data Processing Methods and Systematic Error Limits , 2003, astro-ph/0302222.

[30]  Max Tegmark,et al.  High resolution foreground cleaned CMB map from WMAP , 2003, astro-ph/0302496.

[31]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[32]  J. Billingsley Mathematics for Control , 2005 .

[33]  F. Wilczek,et al.  Dimensionless constants, cosmology and other dark matters , 2005, astro-ph/0511774.

[34]  Pramod P. Khargonekar,et al.  Mathematics of Control , Signals , and Systems , 2006 .

[35]  M. Kardar Statistical physics of fields , 2007 .

[36]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[37]  Dmitry M. Malioutov,et al.  Lagrangian Relaxation for MAP Estimation in Graphical Models , 2007, ArXiv.

[38]  G. Vidal Class of quantum many-body states that can be efficiently simulated. , 2006, Physical review letters.

[39]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[40]  Heribert Vollmer,et al.  Introduction to Circuit Complexity: A Uniform Approach , 2010 .

[41]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[42]  F. Bach,et al.  Optimization with Sparsity-Inducing Penalties (Foundations and Trends(R) in Machine Learning) , 2011 .

[43]  W. Marsden I and J , 2012 .

[44]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[45]  C. Bény Deep learning and the renormalization group , 2013, 1301.3124.

[46]  C. A. Oxborrow,et al.  Planck2013 results. XII. Diffuse component separation , 2013, Astronomy & Astrophysics.

[47]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[48]  Kosaku Nagasaka,et al.  Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation , 2014, ISSAC 2014.

[49]  David J. Schwab,et al.  An exact mapping between the Variational Renormalization Group and Deep Learning , 2014, ArXiv.

[50]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[51]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[52]  Tomaso Poggio,et al.  I-theory on depth vs width: hierarchical function composition , 2015 .

[53]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[54]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[55]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[56]  David J. Schwab,et al.  Supervised Learning with Quantum-Inspired Tensor Networks , 2016, ArXiv.

[57]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[58]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[59]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[60]  Tomaso Poggio,et al.  Learning Functions: When Is Deep Better Than Shallow , 2016, 1603.00988.

[61]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[62]  Matthias Troyer,et al.  Solving the quantum many-body problem with artificial neural networks , 2016, Science.

[63]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[64]  R. K. Simpson Nature Neuroscience , 2022 .

[65]  October I Physical Review Letters , 2022 .