Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits

Policy learning using historical observational data is an important problem that has found widespread applications. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data–an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and also give a heuristic algorithm to solve the distributionally robust policy learning problems efficiently. Additionally, we provide extensive simulations to demonstrate the robustness of our policy.

[1]  Karthyek R. A. Murthy,et al.  Quantifying Distributional Model Risk Via Optimal Transport , 2016, Math. Oper. Res..

[2]  Peter W. Glynn,et al.  Improving predictions of pediatric surgical durations with supervised learning , 2017, International Journal of Data Science and Analytics.

[3]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  Viet Anh Nguyen,et al.  Wasserstein Distributionally Robust Kalman Filtering , 2018, NeurIPS.

[6]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[7]  Yongpei Guan,et al.  Data-driven risk-averse stochastic optimization with Wasserstein metric , 2018, Oper. Res. Lett..

[8]  Elena Smirnova,et al.  Distributionally Robust Counterfactual Risk Minimization , 2019, AAAI.

[9]  Michael R Kosorok,et al.  Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[10]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[11]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[12]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[13]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[14]  Melvyn Sim,et al.  The Price of Robustness , 2004, Oper. Res..

[15]  Henry Lam,et al.  The empirical likelihood approach to quantifying uncertainty in sample average approximation , 2017, Oper. Res. Lett..

[16]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[17]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[18]  Peter S. Fader,et al.  Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments , 2016, Mark. Sci..

[19]  D. Kuhn,et al.  Data-Driven Chance Constrained Programs over Wasserstein Balls , 2018, Operations Research.

[20]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[21]  Alexander Shapiro,et al.  Distributionally Robust Stochastic Programming , 2017, SIAM J. Optim..

[22]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[23]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[24]  Insoon Yang,et al.  Wasserstein Distributionally Robust Stochastic Control: A Data-Driven Approach , 2018, IEEE Transactions on Automatic Control.

[25]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[26]  Benjamin Pfaff,et al.  Perturbation Analysis Of Optimization Problems , 2016 .

[27]  Dimitris Bertsimas,et al.  Optimal classification trees , 2017, Machine Learning.

[28]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[29]  Robert D. Nowak,et al.  Scalable Generalized Linear Bandits: Online Computation and Hashing , 2017, NIPS.

[30]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[31]  Mohsen Bayati,et al.  Online Decision Making with High-Dimensional Covariates , 2020, Oper. Res..

[32]  Min Zhang,et al.  Estimating optimal treatment regimes from a classification perspective , 2012, Stat.

[33]  Vasilis Syrgkanis,et al.  Semi-Parametric Efficient Policy Learning with Continuous Actions , 2019, NeurIPS.

[34]  Daniel Kuhn,et al.  Distributionally Robust Inverse Covariance Estimation: The Wasserstein Shrinkage Estimator , 2018, Oper. Res..

[35]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[36]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[37]  John C. Duchi,et al.  Certifiable Distributional Robustness with Principled Adversarial Training , 2017, ArXiv.

[38]  Huan Xu,et al.  Robust Hypothesis Testing Using Wasserstein Uncertainty Sets , 2018, NeurIPS.

[39]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[40]  C. Blumberg Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , 2016 .

[41]  Daniel Kuhn,et al.  Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations , 2015, Mathematical Programming.

[42]  C. Tomlin,et al.  Stochastic Control With Uncertain Parameters via Chance Constrained Control , 2016, IEEE Transactions on Automatic Control.

[43]  Yinyu Ye,et al.  Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems , 2010, Oper. Res..

[44]  D. Aldous The Central Limit Theorem for Real and Banach Valued Random Variables , 1981 .

[45]  Eric B. Laber,et al.  Doubly Robust Learning for Estimating Individualized Treatment with Censored Data. , 2015, Biometrika.

[46]  Zhaolin Hu,et al.  Kullback-Leibler divergence constrained distributionally robust optimization , 2012 .

[47]  Jaeho Lee,et al.  Minimax Statistical Learning with Wasserstein distances , 2017, NeurIPS.

[48]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[49]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[50]  Soumyadip Ghosh,et al.  Robust Analysis in Stochastic Simulation: Computation and Performance Guarantees , 2015, Oper. Res..

[51]  Olivier Chapelle,et al.  Modeling delayed feedback in display advertising , 2014, KDD.

[52]  Donglin Zeng,et al.  Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[53]  Xi Chen,et al.  Online EXP3 Learning in Adversarial Bandits with Delayed Feedback , 2019, NeurIPS.

[54]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[55]  Zhengyuan Zhou,et al.  Balanced Linear Contextual Bandits , 2018, AAAI.

[56]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[57]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[58]  Güzin Bayraksan,et al.  Data-Driven Stochastic Programming Using Phi-Divergences , 2015 .

[59]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[60]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[61]  John C. Duchi,et al.  Distributionally Robust Losses Against Mixture Covariate Shifts , 2019 .

[62]  Susan Athey,et al.  Estimation Considerations in Contextual Bandits , 2017, ArXiv.

[63]  Dimitris Bertsimas,et al.  A Learning Approach for Interactive Marketing to a Customer Segment , 2007, Oper. Res..

[64]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[65]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[66]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[67]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[68]  M. Staib,et al.  Distributionally Robust Deep Learning as a Generalization of Adversarial Training , 2017 .