Exact Lower Bounds for the Agnostic Probably-Approximately-Correct (PAC) Machine Learning Model

We provide an exact non-asymptotic lower bound on the minimax expected excess risk (EER) in the agnostic probably-ap\-proximately-correct (PAC) machine learning classification model and identify minimax learning algorithms as certain maximally symmetric and minimally randomized "voting" procedures. Based on this result, an exact asymptotic lower bound on the minimax EER is provided. This bound is of the simple form $c_\infty/\sqrt{\nu}$ as $\nu\to\infty$, where $c_\infty=0.16997\dots$ is a universal constant, $\nu=m/d$, $m$ is the size of the training sample, and $d$ is the Vapnik--Chervonenkis dimension of the hypothesis class. It is shown that the differences between these asymptotic and non-asymptotic bounds, as well as the differences between these two bounds and the maximum EER of any learning algorithms that minimize the empirical risk, are asymptotically negligible, and all these differences are due to ties in the mentioned "voting" procedures. A few easy to compute non-asymptotic lower bounds on the minimax EER are also obtained, which are shown to be close to the exact asymptotic lower bound $c_\infty/\sqrt{\nu}$ even for rather small values of the ratio $\nu=m/d$. As an application of these results, we substantially improve existing lower bounds on the tail probability of the excess risk. Among the tools used are Bayes estimation and apparently new identities and inequalities for binomial distributions.

[1]  Ruth Urner,et al.  Active Nearest-Neighbor Learning in Metric Spaces , 2016, NIPS.

[2]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[3]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[4]  Philip M. Long The Complexity of Learning According to Two Models of a Drifting Environment , 1998, COLT' 98.

[5]  M. Sion On general minimax theorems , 1958 .

[6]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[7]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[8]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[9]  R. Schapire,et al.  Toward Efficient Agnostic Learning , 1994 .

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[12]  I. Pinelis,et al.  Criterion for complete determinacy for concave-convexlike games , 1991 .

[13]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[14]  Daniel Berend,et al.  A finite sample analysis of the Naive Bayes classifier , 2015, J. Mach. Learn. Res..

[15]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[16]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[17]  Iosif Pinelis,et al.  Optimal binomial, Poisson, and normal left-tail domination for sums of nonnegative random variables , 2015, 1503.06482.

[18]  Hans Ulrich Simon,et al.  General bounds on the number of examples needed for learning probabilistic concepts , 1993, COLT '93.

[19]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[20]  Luc Devroye,et al.  Lower bounds in pattern recognition and learning , 1995, Pattern Recognit..

[21]  J. Kingman A convexity property of positive matrices , 1961 .

[22]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[23]  Gerald S. Rogers,et al.  Mathematical Statistics: A Decision Theoretic Approach , 1967 .