Policy Choice and Best Arm Identification: Comments on "Adaptive Treatment Assignment in Experiments for Policy Choice"

Adaptive experimental design for efficient decision-making is an important problem in economics. The purpose of this paper is to connect the “policy choice” problem, proposed in Kasy and Sautmann (2021) as an instance of adaptive experimental design, to the frontiers of the bandit literature in machine learning. We discuss how the policy choice problem can be framed in a way such that it is identical to what is called the “best arm identification” (BAI) problem. By connecting the literature, we identify that the asymptotic optimality of policy choice algorithms tackled in Kasy and Sautmann (2021) is a long-standing open question in the literature. While Kasy and Sautmann (2021) presents an interesting and important empirical study, unfortunately, this connection highlights several major issues with the theoretical results. In particular, we show that Theorem 1 in Kasy and Sautmann (2021) is false. We find that the proofs of statements (1) and (2) of Theorem 1 are incorrect. Although the statements themselves may be true, they are non-trivial to fix. Statement (3), and its proof, on the other hand, is false, which we show by utilizing existing theoretical results in the bandit literature. As this question is critically important, garnering much interest in the last decade within the bandit community, we provide a review of recent developments in the BAI literature. We hope this serves to highlight the relevance to economic problems and stimulate methodological and theoretical developments in the econometric community.

[1]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.

[2]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[3]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[4]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[5]  Masashi Sugiyama,et al.  Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback , 2019, Neural Computation.

[6]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[7]  Chun-Hung Chen,et al.  Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization , 2000, Discret. Event Dyn. Syst..

[8]  Masashi Sugiyama,et al.  Fully adaptive algorithm for pure exploration in linear bandits , 2017, 1710.05552.

[9]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[10]  Robert D. Nowak,et al.  Top Arm Identification in Multi-Armed Bandits with Batch Arm Pulls , 2016, AISTATS.

[11]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[12]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[13]  I. Johnstone,et al.  ASYMPTOTICALLY OPTIMAL PROCEDURES FOR SEQUENTIAL ADAPTIVE SELECTION OF THE BEST OF SEVERAL NORMAL MEANS , 1982 .

[14]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[15]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[16]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[17]  Alexandre Proutiere,et al.  Optimal Best-arm Identification in Linear Bandits , 2020, NeurIPS.

[18]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[19]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[20]  Ilya O. Ryzhov,et al.  On the Convergence Rates of Expected Improvement Methods , 2016, Oper. Res..

[21]  Csaba Szepesvári,et al.  Structured Best Arm Identification with Fixed Confidence , 2017, ALT.

[22]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[23]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[24]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[25]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[26]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[27]  Rémi Munos,et al.  Stochastic Simultaneous Optimistic Optimization , 2013, ICML.

[28]  Alexandros G. Dimakis,et al.  Identifying Best Interventions through Online Importance Sampling , 2017, ICML.

[29]  Wouter M. Koolen,et al.  Monte-Carlo Tree Search by Best Arm Identification , 2017, NIPS.

[30]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[31]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[32]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[33]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[34]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[35]  Robert E. Bechhofer,et al.  Sequential identification and ranking procedures : with special reference to Koopman-Darmois populations , 1970 .

[36]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[37]  Yuan Zhou,et al.  Best Arm Identification in Linear Bandits with Linear Dimension Dependency , 2018, ICML.

[38]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[39]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[40]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[41]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[42]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[43]  Wouter M. Koolen,et al.  Non-Asymptotic Pure Exploration by Solving Games , 2019, NeurIPS.

[44]  Lalit Jain,et al.  Sequential Experimental Design for Transductive Linear Bandits , 2019, NeurIPS.

[45]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[46]  Peter L. Bartlett,et al.  Best of both worlds: Stochastic & adversarial best-arm identification , 2018, COLT.

[47]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.