Fast learning rates in statistical inference through aggregation

We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form C ([log |G|]/n)^v with tight evaluation of the positive constant C and with exact v in ]0;1], the latter value depending on the convexity of the loss function and on the level of noise in the output distribution. The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worst-case viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key idea of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data. The risk lower bounds are based on refinements of the Assouad's lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the L_q-regression setting for which an exhaustive analysis of the convergence rates is given while q describes [1;+infinity[.

[1]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[2]  L. Schmetterer Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete. , 1963 .

[3]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[4]  E. Slud Distribution Inequalities for the Binomial Law , 1977 .

[5]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[6]  J. Bretagnolle,et al.  Estimation des densités: risque minimax , 1978 .

[7]  Luc Devroye,et al.  Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  P. Assouad Deux remarques sur l'estimation , 1983 .

[9]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[10]  A. Barron Are Bayes Rules Consistent in Information , 1987 .

[11]  李幼升,et al.  Ph , 1989 .

[12]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[13]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[14]  W. Polonik Measuring Mass Concentrations and Estimating Density Contour Clusters-An Excess Mass Approach , 1995 .

[15]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  O. Catoni The Mixture Approach to Universal Model Selection , 1997 .

[18]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[19]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[20]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[21]  G. Lugosi,et al.  On Prediction of Individual Sequences , 1998 .

[22]  G. Lugosi,et al.  On Prediction of Individual Sequences , 1998 .

[23]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[24]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[25]  G. Blanchard The “progressive mixture” estimator for regression trees , 1999 .

[26]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[27]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  Yuhong Yang Combining Different Procedures for Adaptive Regression , 2000, Journal of Multivariate Analysis.

[30]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[31]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[32]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[33]  András Antos,et al.  Lower bounds for the rate of convergence in nonparametric pattern recognition , 2002, Theor. Comput. Sci..

[34]  A. Tsybakov,et al.  Introduction à l'estimation non-paramétrique , 2003 .

[35]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[36]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[37]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[38]  Jean-Yves Audibert A BETTER VARIANCE CONTROL FOR PAC-BAYESIAN CLASSIFICATION , 2004 .

[39]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[40]  O. Catoni A PAC-Bayesian approach to adaptive classification , 2004 .

[41]  Ran El-Yaniv,et al.  How to Better Use Expert Advice , 2004, Machine Learning.

[42]  Jean-Yves Audibert Aggregated estimators and empirical complexity for least square regression , 2004 .

[43]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[44]  Jean-Yves Audibert Classification under polynomial entropy and margin assump-tions and randomized estimators , 2004 .

[45]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2005, 0708.2321.

[46]  G. Lecu'e Simultaneous adaptation to the margin and to complexity in classification , 2005, math/0509696.

[47]  L. Birge,et al.  A new lower bound for multiple hypothesis testing , 2005, IEEE Transactions on Information Theory.

[48]  Tong Zhang Data Dependent Concentration Bounds for Sequential Prediction Algorithms , 2005, COLT.

[49]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[50]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[51]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[52]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[53]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[54]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[55]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[56]  G. Lecu'e Optimal rates of aggregation in classification under low noise assumption , 2006, math/0603447.

[57]  Guillaume Lecué,et al.  Suboptimality of Penalized Empirical Risk Minimization in Classification , 2007, COLT.

[58]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[59]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[60]  Andrew B. Nobel,et al.  Sequential Procedures for Aggregating Arbitrary Estimators of a Conditional Mean , 2008, IEEE Transactions on Information Theory.

[61]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[62]  Pierre Alquier,et al.  Iterative feature selection in least square regression estimation , 2005, math/0511299.