A theory of universal learning

How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.

[1]  Claude Dellacherie Les dérivations en théorie descriptive des ensembles et le théorème de la borne , 1977 .

[2]  Alessandro Rudi,et al.  Exponential convergence of testing error for stochastic gradient methods , 2017, COLT.

[3]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[4]  Vladimir Koltchinskii,et al.  Exponential Convergence Rates in Classification , 2005, COLT.

[5]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[6]  Steve Hanneke,et al.  Activized Learning: Transforming Passive to Active with Improved Label Complexity , 2011, J. Mach. Learn. Res..

[7]  David A. Cohn,et al.  Can Neural Networks Do Better Than the Vapnik-Chervonenkis Bounds? , 1990, NIPS.

[8]  HausslerDavid,et al.  A general lower bound on the number of examples needed for learning , 1989 .

[9]  Gábor Lugosi,et al.  Strong minimax lower bounds for learning , 1996, COLT '96.

[10]  Chen C. Chang,et al.  Model Theory: Third Edition (Dover Books On Mathematics) By C.C. Chang;H. Jerome Keisler;Mathematics , 1966 .

[11]  Vladimir Pestov,et al.  PAC learnability versus VC dimension: A footnote to a basic result of statistical learning , 2011, The 2011 International Joint Conference on Neural Networks.

[12]  Joel David Hamkins,et al.  Transfinite Game Values in Infinite Chess , 2014, Integers.

[13]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[14]  Dale Schuurmans Characterizing Rational Versus Exponential learning Curves , 1997, J. Comput. Syst. Sci..

[15]  Benjamin Naumann,et al.  Classical Descriptive Set Theory , 2016 .

[16]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[19]  JAMES MURPHY,et al.  CARDINAL AND ORDINAL NUMBERS , 2009 .

[20]  Ramon van Handel,et al.  The universal Glivenko–Cantelli property , 2010, 1009.4434.

[21]  Aryeh Kontorovich,et al.  Universal Bayes Consistency in Metric Spaces , 2019, 2020 Information Theory and Applications Workshop (ITA).

[22]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[23]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[24]  Steve Hanneke,et al.  Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes , 2017, 2020 Information Theory and Applications Workshop (ITA).

[25]  Taiji Suzuki,et al.  Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors , 2018, AISTATS.

[26]  Gerald Tesauro,et al.  How Tight Are the Vapnik-Chervonenkis Bounds? , 1992, Neural Computation.

[27]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[28]  Alon Itai,et al.  Nonuniform Learnability , 1988, J. Comput. Syst. Sci..

[29]  Liu Yang,et al.  Activized Learning with Uniform Classification Noise , 2013, ICML.

[30]  Steve Hanneke,et al.  Theoretical foundations of active learning , 2009 .

[31]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[32]  George Voutsadakis,et al.  Introduction to Set Theory , 2021, A Problem Based Journey from Elementary Number Theory to an Introduction to Matrix Theory.

[33]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.