Optimal Fixed-Budget Best Arm Identification using the Augmented Inverse Probability Estimator in Two-Armed Gaussian Bandits with Unknown Variances

We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.

[1]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[2]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[3]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[4]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[5]  Annie Liang,et al.  Dynamically Aggregating Diverse Information , 2019, EC.

[6]  Dean Karlan,et al.  Adaptive Experimental Design Using the Propensity Score , 2009 .

[7]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[8]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[9]  Junpei Komiyama,et al.  Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling , 2021, 2109.08229.

[10]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[11]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[12]  Susan Athey,et al.  Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits , 2021, KDD.

[13]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.

[14]  Keisuke Hirano,et al.  Asymptotic analysis of statistical decision rules in econometrics , 2020 .

[15]  R. Ellis,et al.  LARGE DEVIATIONS FOR A GENERAL-CLASS OF RANDOM VECTORS , 1984 .

[16]  Xiequan Fan,et al.  Cramér large deviation expansions for martingales under Bernstein’s condition , 2012, 1210.2198.

[17]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[18]  M. J. van der Laan,et al.  STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY. , 2016, Annals of statistics.

[19]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[20]  Zhengyuan Zhou,et al.  Online Multi-Armed Bandits with Adaptive Inference , 2021, NeurIPS.

[21]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[22]  J. Honda,et al.  Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm , 2020, ArXiv.

[23]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[24]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[25]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[26]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[27]  Stefan Wager,et al.  Confidence intervals for policy evaluation in adaptive experiments , 2021, Proceedings of the National Academy of Sciences.

[28]  Ion Grama,et al.  Large deviations for martingales via Cramér's method , 2000 .

[29]  Wouter M. Koolen,et al.  Non-Asymptotic Pure Exploration by Solving Games , 2019, NeurIPS.

[30]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[31]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[32]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[33]  Antoine Chambaz,et al.  Post-Contextual-Bandit Inference , 2021, NeurIPS.

[34]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[35]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[36]  J. Gärtner On Large Deviations from the Invariant Measure , 1977 .

[37]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[38]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[39]  Lalit Jain,et al.  An Empirical Process Approach to the Union Bound: Practical Algorithms for Combinatorial and Linear Bandits , 2020, NeurIPS.

[40]  Aurélien Garivier,et al.  On the Complexity of A/B Testing , 2014, COLT.

[41]  Nikos Vlassis,et al.  More Efficient Off-Policy Evaluation through Regularized Targeted Learning , 2019, ICML.

[42]  K. Hirano,et al.  Asymptotics for Statistical Treatment Rules , 2009 .

[43]  Robert E. Bechhofer,et al.  Sequential identification and ranking procedures : with special reference to Koopman-Darmois populations , 1970 .

[44]  J Mark,et al.  The Construction and Analysis of Adaptive Group Sequential Designs , 2008 .

[45]  David Childers,et al.  Efficient Online Estimation of Causal Effects by Deciding What to Observe , 2021, NeurIPS.

[46]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.

[47]  I. Johnstone,et al.  ASYMPTOTICALLY OPTIMAL PROCEDURES FOR SEQUENTIAL ADAPTIVE SELECTION OF THE BEST OF SEVERAL NORMAL MEANS , 1982 .

[48]  J. Hahn On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects , 1998 .

[49]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[50]  Shota Yasui,et al.  Efficient Counterfactual Learning from Bandit Feedback , 2018, AAAI.

[51]  Wonyoung Kim,et al.  Doubly Robust Thompson Sampling for linear payoffs , 2021, ArXiv.

[52]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[53]  M. J. Laan,et al.  Online Targeted Learning , 2014 .

[54]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[55]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[56]  Junpei Komiyama,et al.  Optimal Simple Regret in Bayesian Best Arm Identification , 2021 .

[57]  A. Zeevi,et al.  Online Ordinal Optimization under Model Misspecification , 2021 .

[58]  Max Tabord-Meehan,et al.  Stratification Trees for Adaptive Randomization in Randomized Controlled Trials , 2018, The Review of Economic Studies.