Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting

Classifiers built with neural networks handle large-scale high dimensional data, such as facial images from computer vision, extremely well while traditional statistical methods often fail miserably. In this paper, we attempt to understand this empirical success in high dimensional classification by deriving the convergence rates of excess risk. In particular, a teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks. In this setup, we obtain a sharp rate of convergence, i.e., Õd(n −2/3)∗, for classifiers trained using either 0-1 loss or hinge loss. This rate can be further improved to Õd(n −1) when the data distribution is separable. Here, n denotes the sample size. An interesting observation is that the data dimension only contributes to the log(n) term in the above rates. This may provide one theoretical explanation for the empirical successes of deep neural networks in high dimensional classification, particularly for structured data.

[1]  Anthony C. C. Coolen,et al.  Statistical mechanical analysis of the dynamics of learning in perceptrons , 1997, Stat. Comput..

[2]  S. Geer,et al.  Square root penalty: Adaptation to the margin in classification and in edge estimation , 2005, math/0507422.

[3]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[4]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[5]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[6]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[7]  Yuandong Tian,et al.  A theoretical framework for deep locally connected ReLU network , 2018, ArXiv.

[8]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[9]  Yuandong Tian Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network , 2019, ArXiv.

[10]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[13]  M. Kohler,et al.  On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[14]  Sridha Sridharan,et al.  Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective , 2018, IEEE Access.

[15]  Yuan Cao,et al.  Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks , 2019, NeurIPS.

[16]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[19]  Christian Tjandraatmadja,et al.  Bounding and Counting Linear Regions of Deep Neural Networks , 2017, ICML.

[20]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[21]  Michael Kohler,et al.  On the rate of convergence of fully connected very deep neural network regression estimates , 2019, The Annals of Statistics.

[22]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[23]  Tuo Zhao,et al.  Efficient Approximation of Deep ReLU Networks for Functions on Low Dimensional Manifolds , 2019, NeurIPS.

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[26]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[27]  Ruiqi Liu,et al.  Optimal Nonparametric Inference via Deep Neural Network , 2019, ArXiv.

[28]  Kenji Fukumizu,et al.  Deep Neural Networks Learn Non-Smooth Functions Effectively , 2018, AISTATS.

[29]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[30]  Taiji Suzuki,et al.  Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks , 2019, ICML.

[31]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[32]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[33]  Taiji Suzuki,et al.  Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , 2018, ICLR.

[34]  Masaaki Imaizumi,et al.  Adaptive Approximation and Estimation of Deep Neural Network to Intrinsic Dimensionality , 2019, ArXiv.

[35]  Yongdai Kim,et al.  Fast convergence rates of deep neural networks for classification , 2018, Neural Networks.

[36]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[37]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[38]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[39]  A. Tsybakov,et al.  Minimax theory of image reconstruction , 1993 .

[40]  Guang Cheng,et al.  Rate Optimal Variational Bayesian Inference for Sparse DNN , 2019, 1910.04355.

[41]  Dmitry Yarotsky,et al.  The phase diagram of approximation rates for deep neural networks , 2019, NeurIPS.

[42]  Nicolas Macris,et al.  The committee machine: computational to statistical gaps in learning a two-layers neural network , 2018, NeurIPS.

[43]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[44]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[45]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Florent Krzakala,et al.  Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.

[48]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[49]  David Saad,et al.  Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks , 1995, NIPS.

[50]  Sanjog Misra,et al.  Deep Neural Networks for Estimation and Inference: Application to Causal Effects and Other Semiparametric Estimands , 2018, Econometrica.

[51]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..