Towards Understanding Generalization via Analytical Learning Theory

This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

[1]  E. Hlawka Funktionen von beschränkter Variatiou in der Theorie der Gleichverteilung , 1961 .

[2]  H. Niederreiter Quasi-Monte Carlo methods and pseudo-random numbers , 1978 .

[3]  R. Dudley A course on empirical processes , 1984 .

[4]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  R. Ash,et al.  Probability and measure theory , 1999 .

[7]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[8]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[9]  E. Novak,et al.  The inverse of the star-discrepancy depends linearly on the dimension , 2001 .

[10]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[11]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[12]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[15]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[16]  Christoph Aistleitner,et al.  Covering numbers, dyadic chaining and discrepancy , 2011, J. Complex..

[17]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[18]  C. Aistleitner,et al.  Low-discrepancy point sets for non-uniform measures , 2013, 1308.5049.

[19]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[20]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[21]  C. Aistleitner,et al.  Functions of bounded variation, signed measures, and a general Koksma-Hlawka inequality , 2014, 1406.0230.

[22]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[23]  Shin Ishii,et al.  Distributional Smoothing with Virtual Adversarial Training , 2015, ICLR 2016.

[24]  Leslie Pack Kaelbling,et al.  Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[25]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[26]  R. Tichy,et al.  On functions of bounded variation† , 2015, Mathematical Proceedings of the Cambridge Philosophical Society.

[27]  Yu Maruyama,et al.  Global Continuous Optimization with Error Bound and Fast Convergence , 2016, J. Artif. Intell. Res..

[28]  Tolga Tasdizen,et al.  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[29]  Kenji Kawaguchi,et al.  Bounded Optimal Exploration in MDP , 2016, AAAI.

[30]  Tegan Maharaj,et al.  Deep Nets Don't Learn via Memorization , 2017, ICLR.

[31]  Lei Wu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[32]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[33]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[34]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[35]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[36]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[37]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[38]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[39]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[40]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[41]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[42]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[43]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[44]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[45]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[46]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[47]  Shie Mannor,et al.  Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms , 2016, ICLR.