The importance of convexity in learning with squared loss

We show that if the closure of a function class under the metric induced by some probability distribution is not convex, then the sample complexity for agnostically learning with squared loss (using only hypotheses in )i s where is the probability of success and is the required accuracy. In comparison, if the class is convex and has finite pseudodimension, then the sample complexity is . If a nonconvex class has finite pseudodimension, then the sample complexity for agnostically learning the closure of the convex hull of ,i s . Hence, for agnostic learning, learning the convex hull provides better approximation capabilities with little sample complexity penalty.

[1]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[2]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[3]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[4]  D. Braess Nonlinear Approximation Theory , 1986 .

[5]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[6]  H. Balsters,et al.  Learnability with respect to fixed distributions , 1991 .

[7]  Martin Anthony,et al.  Computational learning theory: an introduction , 1992 .

[8]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[9]  R. Schapire Toward Eecient Agnostic Learning , 1992 .

[10]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[11]  John Shawe-Taylor,et al.  Bounding Sample Size with the Vapnik-Chervonenkis Dimension , 1993, Discrete Applied Mathematics.

[12]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[13]  Daniel F. McCaffrey,et al.  Convergence rates for single hidden layer feedforward networks , 1994, Neural Networks.

[14]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[15]  Philip M. Long,et al.  A Generalization of Sauer's Lemma , 1995, J. Comb. Theory A.

[16]  Peter L. Bartlett,et al.  On efficient agnostic learning of linear combinations of basis functions , 1995, COLT '95.

[17]  Wolfgang Maass,et al.  Agnostic PAC Learning of Functions on Analog Neural Nets , 1993, Neural Computation.

[18]  D. Pollard Uniform ratio limit theorems for empirical processes , 1995 .

[19]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[20]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[21]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[22]  Sanjeev R. Kulkarni,et al.  Covering numbers for real-valued function classes , 1997, IEEE Trans. Inf. Theory.

[23]  Shai Ben-David,et al.  Learning Distributions by Their Density Levels: A Paradigm for Learning without a Teacher , 1997, J. Comput. Syst. Sci..

[24]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.