Approximation and Estimation for High-Dimensional Deep Learning Networks

It has been experimentally observed in recent years that multi-layer artificial neural networks have a surprising ability to generalize, even when trained with far more parameters than observations. Is there a theoretical basis for this? The best available bounds on their metric entropy and associated complexity measures are essentially linear in the number of parameters, which is inadequate to explain this phenomenon. Here we examine the statistical risk (mean squared predictive error) of multi-layer networks with $\ell^1$-type controls on their parameters and with ramp activation functions (also called lower-rectified linear units). In this setting, the risk is shown to be upper bounded by $[(L^3 \log d)/n]^{1/2}$, where $d$ is the input dimension to each layer, $L$ is the number of layers, and $n$ is the sample size. In this way, the input dimension can be much larger than the sample size and the estimator can still be accurate, provided the target function has such $\ell^1$ controls and that the sample size is at least moderately large compared to $L^3\log d$. The heart of the analysis is the development of a sampling strategy that demonstrates the accuracy of a sparse covering of deep ramp networks. Lower bounds show that the identified risk is close to being optimal.

[1]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[2]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[3]  A. Böttcher,et al.  Introduction to Large Truncated Toeplitz Matrices , 1998 .

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[6]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[7]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[8]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[9]  Pär Kurlberg,et al.  A local Riemann hypothesis, I , 2000 .

[10]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[11]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[12]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[13]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[14]  Gerald H. L. Cheang Neural network approximation and estimation of functions , 1994, Proceedings of 1994 Workshop on Information Theory and Statistics.

[15]  N. J. A. Sloane,et al.  Lower bounds for constant weight codes , 1980, IEEE Trans. Inf. Theory.

[16]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[17]  Shahar Mendelson,et al.  On the Size of Convex Hulls of Small Sets , 2002, J. Mach. Learn. Res..

[18]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[21]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[22]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[23]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[24]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[25]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[26]  Dino Oglic,et al.  Constructive Approximation and Learning by Greedy Algorithms , 2018 .

[27]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[28]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[29]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[30]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[31]  Andrew R. Barron,et al.  Minimax lower bounds for ridge combinations including neural nets , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[32]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[33]  Cong Huang Risk of penalized least squares, greedy selection andl 1-penalization for flexible function libraries , 2008 .

[34]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[35]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[36]  Jason M. Klusowski,et al.  Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks , 2016, 1607.01434.

[37]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[38]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Dmitry Yarotsky,et al.  Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[41]  I. Daubechies,et al.  Sets of Matrices All Infinite Products of Which Converge , 1992 .

[42]  Yuhong Yang,et al.  Metric entropy and sparse linear approximation of lq-hulls for 0 , 2013, J. Approx. Theory.

[43]  Andrew R. Barron,et al.  Estimation with two hidden layer neural nets , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).