Fast rates in statistical and online learning

The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning -- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting.

[1]  Yu. V. Prokhorov Convergence of Random Processes and Limit Theorems in Probability Theory , 1956 .

[2]  H. Richter Parameterfreie Abschätzung und Realisierung von Erwartungswerten , 1957 .

[3]  W. J. Studden,et al.  Tchebycheff Systems: With Applications in Analysis and Statistics. , 1967 .

[4]  Gerald S. Rogers,et al.  Mathematical Statistics: A Decision Theoretic Approach , 1967 .

[5]  J. Kemperman The General Moment Problem, A Geometric Approach , 1968 .

[6]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[7]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[8]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[9]  A. Barron Are Bayes Rules Consistent in Information , 1987 .

[10]  Thomas M. Cover,et al.  Open Problems in Communication and Computation , 2011, Springer New York.

[11]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[12]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[13]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[14]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[15]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[16]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[17]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization , 1997 .

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[20]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[21]  A. Barron,et al.  Estimation of mixture models , 1999 .

[22]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[23]  Peter Grünwald Viewing all models as “probabilistic” , 1999, COLT '99.

[24]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[25]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[26]  V. Vovk Competitive On‐line Statistics , 2001 .

[27]  Shahar Mendelson,et al.  Agnostic Learning Nonconvex Function Classes , 2002, COLT.

[28]  Mathukumalli Vidyasagar,et al.  Learning and Generalization: With Applications to Neural Networks , 2002 .

[29]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[30]  Mark Braverman,et al.  Learnability and automatizability , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[31]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[32]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[33]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[34]  P. Bartlett,et al.  Empirical minimization , 2006 .

[35]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[36]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[37]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[38]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[39]  A. V. D. Vaart,et al.  Misspecification in infinite-dimensional Bayesian statistics , 2006, math/0607023.

[40]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[41]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[42]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[43]  P. Massart,et al.  Risk bounds for statistical learning , 2007, math/0702683.

[44]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[45]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[46]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[47]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[48]  Shahar Mendelson,et al.  Lower Bounds for the Empirical Minimization Algorithm , 2008, IEEE Transactions on Information Theory.

[49]  J. Rissanen,et al.  That Simple Device Already Used by Gauss , 2008 .

[50]  Vladimir Vovk,et al.  Prediction with expert advice for the Brier game , 2007, ICML '08.

[51]  Shahar Mendelson,et al.  Obtaining fast error rates in nonconvex situations , 2008, J. Complex..

[52]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[53]  Vladimir Vovk,et al.  Supermartingales in prediction with expert advice , 2008, Theor. Comput. Sci..

[54]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[55]  Peter Grünwald,et al.  Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity , 2011, COLT.

[56]  P. Bartlett,et al.  Margin-adaptive model selection in statistical learning , 2008, 0804.2937.

[57]  Mark D. Reid,et al.  Mixability is Bayes Risk Curvature Relative to Log Loss , 2011, COLT.

[58]  R. Bass Convergence of probability measures , 2011 .

[59]  Shai Ben-David,et al.  Multiclass Learnability and the ERM principle , 2011, COLT.

[60]  Guillaume Lecué Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis , 2011 .

[61]  Mark D. Reid,et al.  Mixability in Statistical Learning , 2012, NIPS.

[62]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[63]  Arnak S. Dalalyan,et al.  Mirror averaging with sparsity priors , 2010, 1003.1189.

[64]  S. Walker,et al.  Bayesian asymptotics with misspecified models , 2013 .

[65]  R. Ramamoorthi,et al.  On Posterior Concentration in Misspecified Models , 2013, 1312.4620.

[66]  Tim van Erven,et al.  From Exp-concavity to Mixability , 2013 .

[67]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[68]  Robert C. Williamson,et al.  From Stochastic Mixability to Fast Rates , 2014, NIPS.

[69]  Shai Ben-David,et al.  The sample complexity of agnostic learning under deterministic labels , 2014, COLT.

[70]  Wouter M. Koolen,et al.  Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..

[71]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[72]  Wouter M. Koolen,et al.  Learning the Learning Rate for Prediction with Expert Advice , 2014, NIPS.

[73]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[74]  Xinhua Zhang,et al.  Exp-Concavity of Proper Composite Losses , 2015, COLT.

[75]  Mark D. Reid,et al.  Composite Multiclass Losses , 2011, J. Mach. Learn. Res..