Nonparametric regression using deep neural networks with ReLU activation function

The author is very grateful to the discussants for sharing their viewpoints on the article. The discussant contributions highlight the gaps in the theoretical understanding and outline many possible directions for future research in this area. The rejoinder is structured according to topics. We refer to [GMMM], [K], [KL], and [S] for the discussant contributions by Ghorbani et al., Kutyniok, Kohler & Langer, and Shamir, respectively.

[1]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[2]  Johannes Schmidt-Hieber,et al.  A regularity class for the roots of nonnegative functions , 2017 .

[3]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[4]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[6]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[7]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[8]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  Jeremy Kepner,et al.  Neural Network Topologies for Sparse Training , 2018, 2018 IEEE MIT Undergraduate Research Technology Conference (URTC).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[13]  Masaaki Imaizumi,et al.  Adaptive Approximation and Estimation of Deep Neural Network to Intrinsic Dimensionality , 2019, ArXiv.

[14]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[15]  Taiji Suzuki,et al.  Fast learning rate of deep learning via a kernel perspective , 2017, ArXiv.

[16]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[17]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[18]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[19]  Yannick Baraud,et al.  Estimating composite functions by model selection , 2011, 1102.2818.

[20]  Joel L. Horowitz,et al.  Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions , 2007, 0803.2999.

[21]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[22]  I. Daubechies,et al.  Wavelets on the Interval and Fast Wavelet Transforms , 1993 .

[23]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[24]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[25]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[26]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[27]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[28]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[29]  Maxwell B. Stinchcombe,et al.  Neural network approximation of continuous functionals and continuous functions on compactifications , 1999, Neural Networks.

[30]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[31]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[32]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[33]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[34]  Dabal Pedamonti,et al.  Comparison of non-linear activation functions for deep neural networks on MNIST classification task , 2018, ArXiv.

[35]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[36]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[37]  Jason M. Klusowski,et al.  Uniform Approximation by Neural Networks Activated by First and Second Order Ridge Splines , 2016 .

[38]  Guido Montúfar,et al.  Universal Approximation Depth and Errors of Narrow Belief Networks with Discrete Units , 2013, Neural Computation.

[39]  Andrew R. Barron,et al.  Approximation and Estimation for High-Dimensional Deep Learning Networks , 2018, ArXiv.

[40]  Ah Chung Tsoi,et al.  Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results , 1998, Neural Networks.

[41]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[42]  Peter Richtárik,et al.  Stochastic Dual Ascent for Solving Linear Systems , 2015, ArXiv.

[43]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[44]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[45]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[46]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[47]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[48]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[49]  E. Candès New Ties between Computational Harmonic Analysis and Approximation Theory , 2002 .

[50]  Anatoli B. Juditsky,et al.  NONPARAMETRIC ESTIMATION OF COMPOSITE FUNCTIONS , 2009, 0906.0865.

[51]  Dmitry Yarotsky,et al.  Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[52]  Subutai Ahmad,et al.  How Can We Be So Dense? The Benefits of Using Highly Sparse Representations , 2019, ArXiv.

[53]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[54]  M. Kohler,et al.  Nonasymptotic Bounds on the L2 Error of Neural Network Regression Estimates , 2006 .

[55]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[56]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[57]  Konstantin Eckle,et al.  A comparison of deep networks with ReLU activation function and linear spline-type methods , 2018, Neural Networks.

[58]  Michael Kohler,et al.  Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors , 2011, Neural Networks.

[59]  M. Kohler,et al.  On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[60]  Adam Krzyzak,et al.  Adaptive regression estimation with multilayer feedforward neural networks , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[61]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[62]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[63]  Jason M. Klusowski,et al.  Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks , 2016, 1607.01434.

[64]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[65]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[66]  Ameya Prabhu,et al.  Deep Expander Networks: Efficient Deep Networks from Graph Theory , 2017, ECCV.

[67]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[68]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[69]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[70]  V. Koltchinskii,et al.  Concentration inequalities and asymptotic results for ratio type empirical processes , 2006, math/0606788.

[71]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[72]  Johannes Schmidt-Hieber,et al.  Deep ReLU network approximation of functions on a manifold , 2019, ArXiv.

[73]  Daniel F. McCaffrey,et al.  Convergence rates for single hidden layer feedforward networks , 1994, Neural Networks.

[74]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[75]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[76]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[77]  G. Kerkyacharian,et al.  Minimax or maxisets , 2002 .

[78]  Adam Krzyżak,et al.  Nonparametric Regression Based on Hierarchical Interaction Models , 2017, IEEE Transactions on Information Theory.