Bayesian Counterfactual Risk Minimization

We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PAC-Bayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially data-dependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard $L_2$ regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient.

[1]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[2]  Yu-Xiang Wang,et al.  Imitation-Regularized Offline Learning , 2019, AISTATS.

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[6]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7]  John Shawe-Taylor,et al.  PAC-Bayesian Inequalities for Martingales , 2011, IEEE Transactions on Information Theory.

[8]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[11]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[14]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[15]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[16]  John Shawe-Taylor,et al.  Distribution-Dependent PAC-Bayes Priors , 2010, ALT.

[17]  Dacheng Tao,et al.  Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[18]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[19]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[20]  John Shawe-Taylor,et al.  PAC Bayes and Margins , 2003 .

[21]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[22]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[23]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[24]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[25]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[26]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[27]  E. Ionides Truncated Importance Sampling , 2008 .

[28]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[29]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[32]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[33]  Yevgeny Seldin,et al.  PAC-Bayes-Empirical-Bernstein Inequality , 2013, NIPS.

[34]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.