Imitation-Regularized Offline Learning

We study the problem of offline learning in automated decision systems under the contextual bandits model. We are given logged historical data consisting of contexts, (randomized) actions, and (nonnegative) rewards. A common goal is to evaluate what would happen if different actions were taken in the same contexts, so as to optimize the action policies accordingly. The typical approach to this problem, inverse probability weighted estimation (IPWE) [Bottou et al., 2013], requires logged action probabilities, which may be missing in practice due to engineering complications. Even when available, small action probabilities cause large uncertainty in IPWE, rendering the corresponding results insignificant. To solve both problems, we show how one can use policy improvement (PIL) objectives, regularized by policy imitation (IML). We motivate and analyze PIL as an extension to Clipped-IPWE, by showing that both are lower-bound surrogates to the vanilla IPWE. We also formally connect IML to IPWE variance estimation [Swaminathan and Joachims 2015] and natural policy gradients. Without probability logging, our PIL-IML interpretations justify and improve, by reward-weighting, the state-of-art cross-entropy (CE) loss that predicts the action items among all action candidates available in the same contexts. With probability logging, our main theoretical contribution connects IML-underfitting to the existence of either confounding variables or model misspecification. We show the value and accuracy of our insights by simulations based on Simpson's paradox, standard UCI multiclass-to-bandit conversions and on the Criteo counterfactual analysis challenge dataset.

[1]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[2]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[3]  Alexandros Karatzoglou,et al.  Session-based Recommendations with Recurrent Neural Networks , 2015, ICLR.

[4]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[5]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[6]  S. Julious,et al.  Confounding and Simpson's paradox , 1994, BMJ.

[7]  Charles J. Geyer 5601 Notes: The Subsampling Bootstrap , 2002 .

[8]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[9]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[10]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[11]  E. H. Simpson,et al.  The Interpretation of Interaction in Contingency Tables , 1951 .

[12]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[13]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[14]  Mark J. van der Laan,et al.  Data-adaptive selection of the truncation level for Inverse-Probability-of-Treatment-Weighted estimators , 2008 .

[15]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[17]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[18]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[19]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[20]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[21]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[22]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[25]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[26]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Steffen Rendle,et al.  Factorization Machines with libFM , 2012, TIST.

[29]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[30]  Chih-Jen Lin,et al.  Field-aware Factorization Machines for CTR Prediction , 2016, RecSys.

[31]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[32]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.