Cost-Effective Incentive Allocation via Structured Counterfactual Inference

We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tries to learn a policy for allocating strategic financial incentives to customers and observes only bandit feedback. In contrast to traditional policy optimization frameworks, we take into account the additional reward structure and budget constraints common in this setting, and develop a new two-step method for solving this constrained counterfactual policy optimization problem. Our method first casts the reward estimation problem as a domain adaptation problem with supplementary structure, and then subsequently uses the estimators for optimizing the policy with constraints. We also establish theoretical error bounds for our estimation procedure and we empirically show that the approach leads to significant improvement on both synthetic and real datasets.

[1]  Gustau Camps-Valls,et al.  Sensitivity maps of the Hilbert-Schmidt independence criterion , 2016, Appl. Soft Comput..

[2]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[3]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[4]  Marina Velikova,et al.  Monotone and Partially Monotone Neural Networks , 2010, IEEE Transactions on Neural Networks.

[5]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[6]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[7]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[8]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[9]  Mihaela van der Schaar,et al.  Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design , 2018, ICML.

[10]  Nathan Kallus,et al.  Classifying Treatment Responders Under Causal Effect Monotonicity , 2019, ICML.

[11]  Suchi Saria,et al.  A Non-parametric Bayesian Approach for Estimating Treatment-Response Curves from Sparse Time Series , 2016, MLHC.

[12]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[13]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[14]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[15]  W. Zame,et al.  Counterfactual Policy Optimization Using Domain-Adversarial Neural Networks , 2018 .

[16]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[17]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[18]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[19]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[20]  Adityanand Guntuboyina,et al.  Covering Numbers for Convex Functions , 2012, IEEE Transactions on Information Theory.

[21]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[22]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[23]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[24]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[25]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[26]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[27]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[28]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[29]  David Simchi-Levi,et al.  Uplift Modeling with Multiple Treatments and General Response Types , 2017, SDM.

[30]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Mihaela van der Schaar,et al.  GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets , 2018, ICLR.

[33]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[34]  Adityanand Guntuboyina,et al.  On matrix estimation under monotonicity constraints , 2015, 1506.03430.

[35]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[36]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[37]  Michael I. Jordan,et al.  Information Constraints on Auto-Encoding Variational Bayes , 2018, NeurIPS.

[38]  Arthur Gretton,et al.  An Adaptive Test of Independence with Analytic Kernel Embeddings , 2016, ICML.

[39]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[40]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[41]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[44]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[45]  May D. Wang,et al.  Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization , 2018, ICML.

[46]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[47]  Mihaela van der Schaar,et al.  Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes , 2017, NIPS.

[48]  Lei Xu,et al.  Input Convex Neural Networks : Supplementary Material , 2017 .

[49]  J. Wellner,et al.  Entropy estimate for high-dimensional monotonic functions , 2005, math/0512641.

[50]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[51]  Maya R. Gupta,et al.  Deep Lattice Networks and Partial Monotonic Functions , 2017, NIPS.

[52]  A. Burnetas,et al.  ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT , 2015, Probability in the Engineering and Informational Sciences.

[53]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[54]  M. de Rijke,et al.  Large-scale Validation of Counterfactual Learning Methods: A Test-Bed , 2016, ArXiv.