An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that deep threshold networks can memorize n points in d dimensions using Õ(e 2 + √ n) neurons and Õ(e 2 (d+ √ n) + n)weights, where δ is the minimum distance between the points. In this work, we improve the dependence on δ from exponential to almost linear, proving that Õ( 1 δ + √ n) neurons and Õ( d δ + n) weights are sufficient. Our construction uses Gaussian random weights only in the first layer, while all the subsequent layers use binary or integer weights. We also prove new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating n points on a sphere using hyperplanes.

[1]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[2]  Yih-Fang Huang,et al.  Bounds on the number of hidden neurons in multilayer perceptrons , 1991, IEEE Trans. Neural Networks.

[3]  Panos J. Antsaklis,et al.  A simple method to derive bounds on the size and to train multilayer neural networks , 1991, IEEE Trans. Neural Networks.

[4]  Guang-Bin Huang,et al.  Learning capability and storage capacity of two-hidden-layer feedforward networks , 2003, IEEE Trans. Neural Networks.

[5]  Roman Vershynin,et al.  Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations , 2020, SIAM J. Math. Data Sci..

[6]  Jinwoo Shin,et al.  Provable Memorization via Deep Neural Networks using Sub-linear Parameters , 2020, COLT.

[7]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[8]  Adam Kowalczyk,et al.  Estimates of Storage Capacity of Multilayer Perceptron with Threshold Logic Hidden Units , 1997, Neural Networks.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[11]  Eduardo D. Sontag,et al.  Remarks on Interpolation and Recognition Using Neural Nets , 1990, NIPS.

[12]  A. Batyuk,et al.  Bithreshold Neural Network Classifier , 2020, 2020 IEEE 15th International Conference on Computer Sciences and Information Technologies (CSIT).

[13]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[14]  R. Durbin,et al.  Bounds on the learning capacity of some multi-layer networks , 1989, Biological Cybernetics.

[15]  Ronen Eldan,et al.  Network size and weights size for memorization with two-layers neural networks , 2020, ArXiv.

[16]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[17]  K. Ball An Elementary Introduction to Modern Convex Geometry , 1997 .

[18]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[19]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[20]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[21]  Dimitris Achlioptas,et al.  Bad Global Minima Exist and SGD Can Reach Them , 2019, NeurIPS.

[22]  L. Gordon,et al.  Tutorial on large deviations for the binomial distribution. , 1989, Bulletin of mathematical biology.

[23]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.