Optimal aggregation of classifiers in statistical learning

Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching O(n -1 ) can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.

[1]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[2]  R. Dudley Metric Entropy of Some Classes of Sets with Differentiable Boundaries , 1974 .

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  K. Alexander,et al.  Probability Inequalities for Empirical Processes and a Law of the Iterated Logarithm , 1984 .

[5]  K. Alexander,et al.  Rates of growth and sample moduli for weighted empirical processes indexed by sets , 1987 .

[6]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[7]  O. Lepskii On a Problem of Adaptive Estimation in Gaussian White Noise , 1991 .

[8]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[9]  A. Tsybakov,et al.  Minimax theory of image reconstruction , 1993 .

[10]  E. Mammen,et al.  Asymptotical minimax recovery of sets with smooth boundaries , 1995 .

[11]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[12]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[13]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[16]  Gábor Lugosi,et al.  Scale-sensitive Dimensions and Skeleton Estimates for Classification , 1998, Discrete Applied Mathematics.

[17]  S. Ben-David,et al.  Combinatorial Variability of Vapnik-chervonenkis Classes with Applications to Sample Compression Schemes , 1998, Discrete Applied Mathematics.

[18]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[19]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[20]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[21]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[22]  S. Geer Applications of empirical process theory , 2000 .

[23]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[24]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[25]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[26]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[27]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[28]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[29]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.