论文信息 - Towards Understanding Generalization via Analytical Learning Theory - 字舞流文

Towards Understanding Generalization via Analytical Learning Theory

This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

Yoshua Bengio | Kenji Kawaguchi | Yoshua Bengio | Kenji Kawaguchi

[1] E. Hlawka. Funktionen von beschränkter Variatiou in der Theorie der Gleichverteilung , 1961 .

[2] H. Niederreiter. Quasi-Monte Carlo methods and pseudo-random numbers , 1978 .

[3] R. Dudley. A course on empirical processes , 1984 .

[4] John Shawe-Taylor,et al. Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[5] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[6] R. Ash,et al. Probability and measure theory , 1999 .

[7] V. Koltchinskii,et al. Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[8] Ralf Herbrich,et al. Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[9] E. Novak,et al. The inverse of the star-discrepancy depends linearly on the dimension , 2001 .

[10] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[11] Peter L. Bartlett,et al. Model Selection and Error Estimation , 2000, Machine Learning.

[12] Sayan Mukherjee,et al. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[13] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[15] Shie Mannor,et al. Robustness and generalization , 2010, Machine Learning.

[16] Christoph Aistleitner,et al. Covering numbers, dyadic chaining and discrepancy , 2011, J. Complex..

[17] Ameet Talwalkar,et al. Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[18] C. Aistleitner,et al. Low-discrepancy point sets for non-uniform measures , 2013, 1308.5049.

[19] Yoshua Bengio,et al. Better Mixing via Deep Representations , 2012, ICML.

[20] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[21] C. Aistleitner,et al. Functions of bounded variation, signed measures, and a general Koksma-Hlawka inequality , 2014, 1406.0230.

[22] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[23] Shin Ishii,et al. Distributional Smoothing with Virtual Adversarial Training , 2015, ICLR 2016.

[24] Leslie Pack Kaelbling,et al. Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[25] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[26] R. Tichy,et al. On functions of bounded variation† , 2015, Mathematical Proceedings of the Cambridge Philosophical Society.

[27] Yu Maruyama,et al. Global Continuous Optimization with Error Bound and Fast Convergence , 2016, J. Artif. Intell. Res..

[28] Tolga Tasdizen,et al. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[29] Kenji Kawaguchi,et al. Bounded Optimal Exploration in MDP , 2016, AAAI.

[30] Tegan Maharaj,et al. Deep Nets Don't Learn via Memorization , 2017, ICLR.

[31] Lei Wu,et al. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[32] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.

[33] Gintare Karolina Dziugaite,et al. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[34] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[35] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[36] Yoshua Bengio,et al. A Closer Look at Memorization in Deep Networks , 2017, ICML.

[37] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[38] Stefano Soatto,et al. Emergence of invariance and disentangling in deep representations , 2017 .

[39] Matus Telgarsky,et al. Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[40] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[41] Timo Aila,et al. Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[42] Graham W. Taylor,et al. Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[43] Lorenzo Rosasco,et al. Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[44] Stefano Soatto,et al. Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[45] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[46] Elad Hoffer,et al. Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[47] Shie Mannor,et al. Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms , 2016, ICLR.