论文信息 - Bayesian Counterfactual Risk Minimization - 字舞流文

Bayesian Counterfactual Risk Minimization

We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PAC-Bayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially data-dependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard $L_2$ regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient.

Ben London | Ted Sandler | Ben London | Ted Sandler

[1] Lihong Li,et al. Learning from Logged Implicit Exploration Data , 2010, NIPS.

[2] Yu-Xiang Wang,et al. Imitation-Regularized Offline Learning , 2019, AISTATS.

[3] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5] David A. McAllester. Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[6] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7] John Shawe-Taylor,et al. PAC-Bayesian Inequalities for Martingales , 2011, IEEE Transactions on Information Theory.

[8] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] François Laviolette,et al. PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[11] O. Catoni. PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[12] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[13] J. Atchison,et al. Logistic-normal distributions:Some properties and uses , 1980 .

[14] Shiliang Sun,et al. PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[15] M. de Rijke,et al. Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[16] John Shawe-Taylor,et al. Distribution-Dependent PAC-Bayes Priors , 2010, ALT.

[17] Dacheng Tao,et al. Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[18] John Shawe-Taylor,et al. PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[19] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[20] John Shawe-Taylor,et al. PAC Bayes and Margins , 2003 .

[21] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[22] Shiliang Sun,et al. PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[23] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[24] Thorsten Joachims,et al. Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[25] David A. McAllester. PAC-Bayesian model averaging , 1999, COLT '99.

[26] Gintare Karolina Dziugaite,et al. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[27] E. Ionides. Truncated Importance Sampling , 2008 .

[28] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[29] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[30] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31] D. Rubin,et al. The central role of the propensity score in observational studies for causal effects , 1983 .

[32] Matthias W. Seeger,et al. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[33] Yevgeny Seldin,et al. PAC-Bayes-Empirical-Bernstein Inequality , 2013, NIPS.

[34] John Langford,et al. The offset tree for learning with partial labels , 2008, KDD.