Full error analysis for the training of deep neural networks

Deep learning algorithms have been applied very successfully in recent years to a range of problems out of reach for classical solution paradigms. Nevertheless, there is no completely rigorous mathematical error and convergence analysis which explains the success of deep learning algorithms. The error of a deep learning algorithm can in many situations be decomposed into three parts, the approximation error, the generalization error, and the optimization error. In this work we estimate for a certain deep learning algorithm each of these three errors and combine these three error estimates to obtain an overall error analysis for the deep learning algorithm under consideration. In particular, we thereby establish convergence with a suitable convergence speed for the overall error of the deep learning algorithm under consideration. Our convergence speed analysis is far from optimal and the convergence speed that we establish is rather slow, increases exponentially in the dimensions, and, in particular, suffers from the curse of dimensionality. The main contribution of this work is, instead, to provide a full error analysis (i) which covers each of the three different sources of errors usually emerging in deep learning algorithms and (ii) which merges these three sources of errors into one overall error estimate for the considered deep learning algorithm.

[1]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[2]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[3]  Sebastian Becker,et al.  Solving stochastic differential equations and Kolmogorov equations by means of deep learning , 2018, ArXiv.

[4]  Arnulf Jentzen,et al.  Deep neural network approximations for Monte Carlo algorithms , 2019, ArXiv.

[5]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[6]  Steffen Dereich,et al.  General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type , 2019, Numerische Mathematik.

[7]  Francesco Maggi,et al.  Sets of Finite Perimeter and Geometric Variational Problems: SETS OF FINITE PERIMETER , 2012 .

[8]  H. Mhaskar,et al.  Neural networks for localized approximation , 1994 .

[9]  Christoph Reisinger,et al.  Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems , 2019, Analysis and Applications.

[10]  Gitta Kutyniok,et al.  Error bounds for approximations with deep ReLU neural networks in $W^{s, p}$ norms , 2019, Analysis and Applications.

[11]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[12]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[13]  Philippe von Wurstemberger,et al.  Strong error analysis for stochastic gradient descent optimization algorithms , 2018, 1801.09324.

[14]  Christoph Schwab,et al.  Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ , 2018, Analysis and Applications.

[15]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[16]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[17]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[18]  Philipp Petersen,et al.  Approximation in $L^p(\mu)$ with deep ReLU neural networks. , 2019 .

[19]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[20]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[21]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[22]  Shijun Zhang,et al.  Nonlinear Approximation via Compositions , 2019, Neural Networks.

[23]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[24]  Arnulf Jentzen,et al.  Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates , 2018, J. Complex..

[25]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[26]  Dmitry Yarotsky,et al.  Universal Approximations of Invariant Maps by Neural Networks , 2018, Constructive Approximation.

[27]  Ying Zhang,et al.  On Stochastic Gradient Langevin Dynamics with Dependent Data Streams: The Fully Nonconvex Case , 2019, SIAM J. Math. Data Sci..

[28]  James D. Keeler,et al.  Layered Neural Networks with Gaussian Hidden Units as Universal Approximations , 1990, Neural Computation.

[29]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[30]  Arnulf Jentzen,et al.  A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations , 2018, Memoirs of the American Mathematical Society.

[31]  Zuowei Shen,et al.  Deep Network Approximation Characterized by Number of Neurons , 2019, Communications in Computational Physics.

[32]  Arnulf Jentzen,et al.  DNN Expression Rate Analysis of High-Dimensional PDEs: Application to Option Pricing , 2018, Constructive Approximation.

[33]  Michael Schmitt,et al.  Lower Bounds on the Complexity of Approximating Continuous Functions by Sigmoidal Neural Networks , 1999, NIPS.

[34]  Arnulf Jentzen,et al.  Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations , 2018, SIAM J. Math. Data Sci..

[35]  J. Fort,et al.  Generic Stochastic Gradient Methods , 2013 .

[36]  E. Novak,et al.  Tractability of Multivariate Problems Volume II: Standard Information for Functionals , 2010 .

[37]  Tianping Chen,et al.  Approximation capability to functions of several variables, nonlinear functionals and operators by radial basis function neural networks , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[38]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[39]  C. Micchelli,et al.  Degree of Approximation by Neural and Translation Networks with a Single Hidden Layer , 1995 .

[40]  Edward K. Blum,et al.  Approximation theory and feedforward networks , 1991, Neural Networks.

[41]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[42]  Philipp Petersen,et al.  Topological Properties of the Set of Functions Generated by Neural Networks of Fixed Size , 2018, Found. Comput. Math..

[43]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[44]  Tuan Anh Nguyen,et al.  A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations , 2019, SN Partial Differential Equations and Applications.

[45]  Ke Tang,et al.  Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[46]  E Weinan,et al.  Exponential convergence of the deep neural network approximation for analytic functions , 2018, Science China Mathematics.

[47]  T. Nguyen-Thien,et al.  Approximation of functions and their derivatives: A neural network implementation with applications , 1999 .

[48]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[49]  E Weinan,et al.  Machine Learning Approximation Algorithms for High-Dimensional Fully Nonlinear Partial Differential Equations and Second-order Backward Stochastic Differential Equations , 2017, J. Nonlinear Sci..

[50]  S. Geer Applications of empirical process theory , 2000 .

[51]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[52]  Arnulf Jentzen,et al.  A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients , 2018, Communications in Mathematical Sciences.

[53]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[54]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[55]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[56]  Martin Burger,et al.  Error Bounds for Approximation with Neural Networks , 2001, J. Approx. Theory.

[57]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[58]  Arnulf Jentzen,et al.  Space-time error estimates for deep neural network approximations for differential equations , 2019, Advances in Computational Mathematics.

[59]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[60]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[61]  Rémi Gribonval,et al.  Approximation Spaces of Deep Neural Networks , 2019, Constructive Approximation.

[62]  Helmut Bölcskei,et al.  Deep Neural Network Approximation Theory , 2019, IEEE Transactions on Information Theory.

[63]  Helmut Bölcskei,et al.  The universal approximation power of finite-width deep ReLU networks , 2018, ArXiv.

[64]  Arnulf Jentzen,et al.  Convergence rates for the stochastic gradient descent method for non-convex objective functions , 2019, J. Mach. Learn. Res..

[65]  Gitta Kutyniok,et al.  A Theoretical Analysis of Deep Neural Networks and Parametric PDEs , 2019, Constructive Approximation.

[66]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[67]  S. W. Ellacott,et al.  Aspects of the numerical analysis of neural networks , 1994, Acta Numerica.

[68]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[69]  Eric Moulines,et al.  Non-asymptotic Analysis of Biased Stochastic Approximation Scheme , 2019, COLT.

[70]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[71]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[72]  E. Novak,et al.  Tractability of Multivariate Problems , 2008 .

[73]  Philipp Petersen,et al.  Equivalence of approximation by convolutional neural networks and fully-connected networks , 2018, Proceedings of the American Mathematical Society.