Combining Offline Causal Inference and Online Bandit Learning for Data Driven Decisions

A fundamental question for companies with large amount of logged data is: How to use such logged data together with incoming streaming data to make good decisions? Many companies currently make decisions via online A/B tests, but wrong decisions during testing hurt users' experiences and cause irreversible damage. A typical alternative is offline causal inference, which analyzes logged data alone to make decisions. However, these decisions are not adaptive to the new incoming data, and so a wrong decision will continuously hurt users' experiences. To overcome the aforementioned limitations, we propose a framework to unify offline causal inference algorithms (e.g., weighting, matching) and online learning algorithms (e.g., UCB, LinUCB). We propose novel algorithms and derive bounds on the decision accuracy via the notion of "regret". We derive the first upper regret bound for forest-based online bandit algorithms. Experiments on two real datasets show that our algorithms outperform other algorithms that use only logged data or online feedbacks, or algorithms that do not use the data properly.

[1]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[2]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[3]  Zoran Popovic,et al.  The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits , 2015, AAAI.

[4]  Vianney Perchet,et al.  Bounded regret in stochastic multi-armed bandits , 2013, COLT.

[5]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[6]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[7]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[8]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[9]  David M. Blei,et al.  The Deconfounded Recommender: A Causal Inference Approach to Recommendation , 2018, ArXiv.

[10]  Heinrich Jiang,et al.  Uniform Convergence Rates for Kernel Density Estimation , 2017, ICML.

[11]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[12]  Lihong Li Offline Evaluation and Optimization for Interactive Systems , 2015, WSDM.

[13]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[14]  Melody Y. Guan,et al.  Nonparametric Stochastic Contextual Bandits , 2018, AAAI.

[15]  Wei Chu,et al.  An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[16]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[17]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[18]  Susan Athey,et al.  Estimation Considerations in Contextual Bandits , 2017, ArXiv.

[19]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[20]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[21]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[22]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[23]  E. Duflo,et al.  How Much Should We Trust Differences-in-Differences Estimates? , 2001 .

[24]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[25]  D. McCaffrey,et al.  Propensity score estimation with boosted regression for evaluating causal effects in observational studies. , 2004, Psychological methods.

[26]  J. Borwein,et al.  Uniform Bounds for the Complementary Incomplete Gamma Function , 2009 .

[27]  S. Zahl Bounds for the Central Limit Theorem Error , 1966 .

[28]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[29]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[30]  Elizabeth A. Stuart,et al.  An Introduction to Sensitivity Analysis for Unobserved Confounding in Nonexperimental Prevention Research , 2013, Prevention Science.

[31]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[32]  P. Billingsley,et al.  Probability and Measure , 1980 .

[33]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[34]  Shi Dong,et al.  An Information-Theoretic Analysis for Thompson Sampling with Many Actions , 2018, NeurIPS.

[35]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[36]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[37]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[38]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[39]  Huazheng Wang,et al.  Learning Hidden Features for Contextual Bandits , 2016, CIKM.

[40]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[41]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[42]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: Sensitivity Analysis and Bounds , 2015 .

[43]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[44]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[45]  Raphaël Féraud,et al.  Random Forest for the Contextual Bandit Problem , 2015, AISTATS.

[46]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[47]  John Langford,et al.  Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback , 2019, ICML.

[48]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[49]  Thorsten Joachims,et al.  Multi-armed Bandit Problems with History , 2012, AISTATS.