Counterfactual Data-Fusion for Online Reinforcement Learners

Online learning agents possess the capacity to learn from both personalized experimentation as well as the observed behaviors of other agents interacting with the environment. However, data collected through these different modalities may not be naively combined due to changes in the decisionmaking context, including factors that may be unobserved. The data-fusion problem addresses how information collected under such disparate conditions (observationally, experimentally, and counterfactually) can be combined to yield more informative results than the independent datasets alone. The present work provides a recipe for combining multiple datasets to accelerate learning in a variant of the Multi-Armed Bandit problem with Unobserved Confounders (MABUC). We demonstrate this data-fusion approach through an enhanced Thompson Sampling bandit player, and support its efficacy with extensive simulations.

[1]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[2]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[3]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[4]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[5]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[6]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[7]  Elias Bareinboim,et al.  Causal Inference by Surrogate Experiments: z-Identifiability , 2012, UAI.

[8]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[9]  Xiao-Li Meng,et al.  A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (If You Help Fund It) , 2014 .

[10]  A. Murata Basic study on prevention of human error -How cognitive biases distort decision making and lead to crucial accidents- , 2014 .

[11]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[12]  Elias Bareinboim,et al.  Markov Decision Processes with Unobserved Confounders : A Causal Approach , 2016 .

[13]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[14]  Fiery Cushman,et al.  Showing versus doing: Teaching by demonstration , 2016, NIPS.

[15]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[16]  Katja Hofmann,et al.  Experimental and causal view on information integration in autonomous agents , 2016, ArXiv.

[17]  Annie Liang,et al.  Optimal Learning from Multiple Information Sources , 2017, ArXiv.

[18]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[19]  Elias Bareinboim,et al.  Transfer Learning in Multi-Armed Bandit: A Causal Approach , 2017, AAMAS.

[20]  Alexandros G. Dimakis,et al.  Identifying Best Interventions through Online Importance Sampling , 2017, ICML.

[21]  Ivan Tyukin,et al.  One-trial correction of legacy AI systems and stochastic separation theorems , 2019, Inf. Sci..

[22]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .