Surrogate Objectives for Batch Policy Optimization in One-step Decision Making

We investigate batch policy optimization for cost-sensitive classification and contextual bandits---two related tasks that obviate exploration but require generalizing from observed rewards to action selections in unseen contexts. When rewards are fully observed, we show that the expected reward objective exhibits suboptimal plateaus and exponentially many local optima in the worst case. To overcome the poor landscape, we develop a convex surrogate that is calibrated with respect to entropy regularized expected reward. We then consider the partially observed case, where rewards are recorded for only a subset of actions. Here we generalize the surrogate to partially observed data, and uncover novel objectives for batch contextual bandit training. We find that surrogate objectives remain provably sound in this setting and empirically demonstrate state-of-the-art performance.

[1]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[2]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[3]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[4]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[5]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[6]  Ingo Steinwart How to Compare Different Loss Functions and Their Risks , 2007 .

[7]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[8]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[9]  Yu-Xiang Wang,et al.  Imitation-Regularized Offline Learning , 2019, AISTATS.

[10]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[11]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[12]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[13]  Ed H. Chi,et al.  Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[14]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[17]  Csaba Szepesvári,et al.  Cost-sensitive Multiclass Classification Risk Bounds , 2013, ICML.

[18]  Maya R. Gupta,et al.  Cost-sensitive multi-class classification from probability estimates , 2008, ICML '08.

[19]  Hsuan-Tien Lin Reduction from Cost-Sensitive Multiclass Classification to One-versus-One Binary Classification , 2014, ACML.

[20]  Hans Ulrich Simon,et al.  Robust Trainability of Single Neurons , 1995, J. Comput. Syst. Sci..

[21]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[22]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[23]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[24]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[25]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[26]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[27]  Lucas C. Parra,et al.  Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds , 2010, J. Mach. Learn. Res..

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[30]  Stefan Riezler,et al.  Counterfactual Learning from Bandit Feedback under Deterministic Logging : A Case Study in Statistical Machine Translation , 2017, EMNLP.

[31]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[32]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[33]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[34]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[35]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[36]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .

[37]  Jin Tian,et al.  Graphical Models for Inference with Missing Data , 2013, NIPS.

[38]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[39]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[40]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[41]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[42]  Stefan Riezler,et al.  Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback , 2018, ACL.

[43]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[44]  Yuan Zhou,et al.  Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy , 2019, ICLR.