Deep Neural Networks Learn Non-Smooth Functions Effectively

We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain optimal convergence rates, and thus it has been difficult to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive convergence rates of estimators by DNNs with a ReLU activation, and show that the estimators by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results.

[1]  R. Nickl,et al.  Mathematical Foundations of Infinite-Dimensional Statistical Models , 2015 .

[2]  E. Mammen,et al.  Asymptotical minimax recovery of sets with smooth boundaries , 1995 .

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[7]  Constance Van Eeden,et al.  Mean integrated squared error of kernel estimators when the density and its derivative are not necessarily continuous , 1985 .

[8]  Helmut Bölcskei,et al.  Memory-optimal neural network approximation , 2017, Optical Engineering + Applications.

[9]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[10]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[11]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[12]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[13]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[16]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[17]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[18]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[19]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[22]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[23]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[24]  E. Candès,et al.  Recovering edges in ill-posed inverse problems: optimality of curvelet frames , 2002 .

[25]  E. Candès,et al.  New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities , 2004 .

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[30]  Van Der Vaart,et al.  Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[31]  R. Dudley Metric Entropy of Some Classes of Sets with Differentiable Boundaries , 1974 .

[32]  Manfred Huber,et al.  Using deep learning to enhance cancer diagnosis and classication , 2013 .

[33]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[34]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[35]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[36]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[37]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[38]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[39]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[40]  A. Tsybakov,et al.  Minimax theory of image reconstruction , 1993 .

[41]  Rasool Fakoor,et al.  Using deep learning to enhance cancer diagnosis and classication , 2013 .

[42]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[43]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[44]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[45]  C. vanEeden,et al.  Mean integrated squared error of kernel estimators when the density and its derivative are not necessarily continuous , 1984 .

[46]  C. Chu,et al.  Kernel-Type Estimators of Jump Points and Values of a Regression Function , 1993 .

[47]  Wang-Q Lim,et al.  Compactly supported shearlets are optimally sparse , 2010, J. Approx. Theory.

[48]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[49]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[50]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[51]  M. Clyde,et al.  Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels , 2011, 1112.3149.

[52]  Harry van Zanten,et al.  Information Rates of Nonparametric Gaussian Process Methods , 2011, J. Mach. Learn. Res..

[53]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[54]  Yuichi Yoshida,et al.  Statistically Efficient Estimation for Non-Smooth Probability Densities , 2018, AISTATS.