How Local Should a Learning Method Be?

We consider the question of why modern machine learning methods like support vector machines outperform earlier nonparametric techniques like kNN. Our approach investigates the locality of learning methods, i.e., the tendency to focus mainly on the close-by part of the training set when constructing a new guess at a particular location. We show that, on the one hand, we can expect all consistent learning methods to be local in some sense; hence if we consider consistency a desirable property then a degree of locality is unavoidable. On the other hand, we also claim that earlier methods like k-NN are local in a more strict manner which implies performance limitations. Thus, we argue that a degree of locality is necessary but that this should not be overdone. Support vector machines and related techniques strike a good balance in this matter, which we suggest may partially explain their good performance in practice.

[1]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[2]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[3]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[4]  G. Lugosi,et al.  On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates , 1994 .

[5]  W. Cleveland,et al.  Smoothing by Local Regression: Principles and Methods , 1996 .

[6]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[7]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[8]  Nicolas Le Roux,et al.  The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[9]  A. Krzyżak,et al.  Distribution-Free Pointwise Consistency of Kernel Regression Estimate , 1984 .

[10]  L. Devroye On the Almost Everywhere Convergence of Nonparametric Regression Function Estimates , 1981 .

[11]  Léon Bottou,et al.  Local Algorithms for Pattern Recognition and Dependencies Estimation , 1993, Neural Computation.

[12]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[13]  C. J. Stone,et al.  Optimal Rates of Convergence for Nonparametric Estimators , 1980 .

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[16]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[17]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[18]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[19]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[20]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.