Fast Rates with Unbounded Losses

We present new excess risk bounds for randomized and deterministic estimators for general unbounded loss functions including log loss and squared loss. Our bounds are expressed in terms of the information complexity and hold under the recently introduced $v$-central condition, allowing for high-probability bounds, and its weakening, the $v$-pseudoprobability convexity condition, allowing for bounds in expectation even under heavy-tailed distributions. The parameter $v$ determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-conditions generalize; favorable $v$ in combination with small information complexity leads to $\tilde{O}(1/n)$ rates. While these fast rate conditions control the lower tail of the excess loss, the upper tail is controlled by a new type of witness-of-badness condition which allows us to connect the excess risk to a generalized R\'enyi divergence, generalizing previous results connecting Hellinger distance to KL divergence.

[1]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[2]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[3]  Robert C. Williamson,et al.  From Stochastic Mixability to Fast Rates , 2014, NIPS.

[4]  Jean-Yves Audibert,et al.  Combining PAC-Bayesian and Generic Chaining Bounds , 2007, J. Mach. Learn. Res..

[5]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[6]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[7]  P. Bartlett,et al.  Empirical minimization , 2006 .

[8]  A. V. D. Vaart,et al.  Misspecification in infinite-dimensional Bayesian statistics , 2006, math/0607023.

[9]  Karthik Sridharan,et al.  Learning with Square Loss: Localization through Offset Rademacher Complexity , 2015, COLT.

[10]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[11]  S. Mendelson Learning without concentration for general loss functions , 2014, 1410.3192.

[12]  A. Barron,et al.  Estimation of mixture models , 1999 .

[13]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[14]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[15]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[16]  S. Geer Empirical Processes in M-Estimation , 2000 .

[17]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[18]  Peter Grünwald,et al.  Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity , 2011, COLT.

[19]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[20]  L. Birge Model selection for Gaussian regression with random design , 2004 .

[21]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[22]  Mark D. Reid,et al.  Mixability in Statistical Learning , 2012, NIPS.

[23]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[24]  Jean-Yves Audibert,et al.  Progressive mixture rules are deviation suboptimal , 2007, NIPS.

[25]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[26]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[27]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[28]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[29]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[30]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[31]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[32]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.

[33]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[34]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1996 .

[35]  O. Catoni A PAC-Bayesian approach to adaptive classification , 2004 .

[36]  S. Mendelson On aggregation for heavy-tailed classes , 2015, Probability Theory and Related Fields.

[37]  Peter Grünwald Viewing all models as “probabilistic” , 1999, COLT '99.

[38]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[39]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[40]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[41]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.