Global Minima of DNNs: The Plenty Pantry

A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for some local averaging schemes with zero training error. However, there is no theoretical analysis of interpolating ERM-like methods so far. We take a step in this direction by showing that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.

[1]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[2]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  S. Geer Applications of empirical process theory , 2000 .

[5]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[6]  Ingo Steinwart,et al.  Optimal Learning Rates for Localized SVMs , 2015, J. Mach. Learn. Res..

[7]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[8]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[9]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[12]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.