Optimization Approaches for Counterfactual Risk Minimization with Continuous Actions

Counterfactual reasoning from logged data has become increasingly important for a large range of applications such as web advertising or healthcare. In this paper, we address the problem of counterfactual risk minimization for learning a stochastic policy with a continuous action space. Whereas previous works have mostly focused on deriving statistical estimators with importance sampling, we show that the optimization perspective is equally important for solving the resulting nonconvex optimization problems. Specifically, we demonstrate the benefits of proximal point algorithms and soft-clipping estimators which are more amenable to gradient-based optimization than classical hard clipping. We propose multiple synthetic, yet realistic, evaluation setups, and we release a new large-scale dataset based on web advertising data for this problem that is crucially missing public benchmarks.

[1]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[2]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[4]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[5]  M. Fukushima,et al.  A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[6]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[7]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[11]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[12]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[13]  Zaïd Harchaoui,et al.  Catalyst for Gradient-based Nonconvex Optimization , 2018, AISTATS.

[14]  Keying Ye,et al.  Applied Bayesian Modeling and Causal Inference From Incomplete-Data Perspectives , 2005, Technometrics.

[15]  G. Imbens,et al.  The Propensity Score with Continuous Treatments , 2005 .

[16]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[17]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[18]  Dimitris Bertsimas,et al.  Optimization over Continuous and Multi-dimensional Decisions with Observational Data , 2018, NeurIPS.

[19]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[20]  Vasilis Syrgkanis,et al.  Semi-Parametric Efficient Policy Learning with Continuous Actions , 2019, NeurIPS.

[21]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.

[22]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[23]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[24]  Vasilis Syrgkanis,et al.  Orthogonal Statistical Learning , 2019, The Annals of Statistics.

[25]  R. Altman,et al.  Estimation of the warfarin dose with clinical and pharmacogenetic data. , 2009, The New England journal of medicine.

[26]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[27]  C. Barnes,et al.  Drug dosage in laboratory animals : a handbook , 1964 .

[28]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[29]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[30]  Julien Mairal,et al.  Cyanure: An Open-Source Toolbox for Empirical Risk Minimization for Python, C++, and soon more , 2019, ArXiv.

[31]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[32]  Wei Chu,et al.  An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[33]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.