The Safe Bayesian - Learning the Learning Rate via the Mixability Gap

Standard Bayesian inference can behave suboptimally if the model is wrong. We present a modification of Bayesian inference which continues to achieve good rates with wrong models. Our method adapts the Bayesian learning rate to the data, picking the rate minimizing the cumulative loss of sequential prediction by posterior randomization. Our results can also be used to adapt the learning rate in a PAC-Bayesian context. The results are based on an extension of an inequality due to T. Zhang and others to dependent random variables.

[1]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[2]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  John Langford,et al.  Suboptimal Behavior of Bayes and MDL in Classification Under Misspecification , 2004, COLT.

[5]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[6]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[7]  V. Vovk Competitive On‐line Statistics , 2001 .

[8]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[9]  A. V. D. Vaart,et al.  Asymptotic Statistics: U -Statistics , 1998 .

[10]  A. V. D. Vaart,et al.  Misspecification in infinite-dimensional Bayesian statistics , 2006, math/0607023.

[11]  Yoav Freund,et al.  A Parameter-free Hedging Algorithm , 2009, NIPS.

[12]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[13]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[14]  A. Barron,et al.  Robustly Minimax Codes for Universal Data Compression , 1998 .

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  Ziheng Yang,et al.  Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics. , 2007, Molecular biology and evolution.

[17]  Wouter M. Koolen,et al.  Adaptive Hedge , 2011, NIPS.

[18]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[19]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[20]  A. Barron,et al.  Estimation of mixture models , 1999 .

[21]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[22]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[23]  C. Shalizi Dynamics of Bayesian Updating with Dependent Data and Misspecified Models , 2009, 0901.1342.

[24]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[25]  Peter Grünwald,et al.  Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity , 2011, COLT.

[26]  Matthias Seeger,et al.  PAC-Bayesian Generalization Error Bounds for GaussianPro ess Classi ationMatthias , 2002 .

[27]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[28]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[29]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .