The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths

In this paper we consider the two-armed bandit problem, which often naturally appears per se or as a subproblem in some multi-armed generalizations, and serves as a starting point for introducing additional problem features. The consideration of binary responses is motivated by its widespread applicability and by being one of the most studied settings. We focus on the undiscounted finite-horizon objective, which is the most relevant in many applications. We make an attempt to unify the terminology as this is different across disciplines that have considered this problem, and present a unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating. We give an extensive account of the history and state of the art of approaches from several disciplines, including design of experiments, Bayesian decision theory, naive designs, reinforcement learning, biostatistics, and combination designs. We evaluate these designs, together with a few newly proposed, accurately computationally (using a newly written package in Julia programming language by the author) in order to compare their performance. We show that conclusions are different for moderate horizons (typical in practice) than for small horizons (typical in academic literature reporting computational results). We further list and clarify a number of myths about this problem, e.g., we show that, computationally, much larger problems can be designed to Bayes-optimality than what is commonly believed.

[1]  M. Rothschild A two-armed bandit theory of market pricing , 1974 .

[2]  Brian Gluss A Note on a Computational Approximation to the Two-Machine Problem , 1962, Inf. Control..

[3]  Richard Steck A dynamic programming strategy for the two machine problem , 1964 .

[4]  S. Yakowitz Mathematics of adaptive control processes , 1969 .

[5]  Dorian Feldman Contributions to the "Two-Armed Bandit" Problem , 1962 .

[6]  Bradley P. Carlin,et al.  Bayesian Adaptive Methods for Clinical Trials , 2010 .

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  José Niño-Mora,et al.  Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[9]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[10]  K. Glazebrook On Randomized Dynamic Allocation Indices for the Sequential Design of Experiments , 1980 .

[11]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[12]  Paul Fearnhead,et al.  On the Identification and Mitigation of Weaknesses in the Knowledge Gradient Policy for Multi-Armed Bandits , 2016, ArXiv.

[13]  J. Gittins,et al.  The Learning Component of Dynamic Allocation Indices , 1992 .

[14]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[15]  A. Shwartz,et al.  The Poisson Equation for Countable Markov Chains: Probabilistic Methods and Interpretations , 2002 .

[16]  J. Higgins,et al.  Cochrane Handbook for Systematic Reviews of Interventions , 2010, International Coaching Psychology Review.

[17]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[18]  Sophie Ahrens,et al.  Recommender Systems , 2012 .

[19]  John R. Birge,et al.  An Approximation Approach for Response Adaptive Clinical Trial Design , 2020 .

[20]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[21]  Quentin F. Stout,et al.  New adaptive designs for delayed response models , 2006 .

[22]  Aurélien Garivier,et al.  Learning the distribution with largest mean: two bandit frameworks , 2017, ArXiv.

[23]  W. R. Thompson On the Theory of Apportionment , 1935 .

[24]  A. V. den Boer,et al.  Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions , 2013 .

[25]  A. Kesselheim,et al.  Adaptive design clinical trials: a review of the literature and ClinicalTrials.gov , 2018, BMJ Open.

[26]  Murray K. Clayton,et al.  Small-sample performance of Bernoulli two-armed bandit Bayesian strategies , 1999 .

[27]  Marek Petrik,et al.  Value Directed Exploration in Multi-Armed Bandits with Structured Priors , 2017, UAI.

[28]  J. Matthews,et al.  Randomization in Clinical Trials: Theory and Practice; , 2003 .

[29]  L. J. Wei,et al.  The Randomized Play-the-Winner Rule in Medical Trials , 1978 .

[30]  Glen L. Urban,et al.  Morphing Theory and Applications , 2017 .

[31]  Tomasz Burzykowski,et al.  Adaptive Randomization of Neratinib in Early Breast Cancer. , 2016, The New England journal of medicine.

[32]  M. Zelen,et al.  Play the Winner Rule and the Controlled Clinical Trial , 1969 .

[33]  George H. Weiss,et al.  A two-stage procedure for choosing the better of two binomial populations , 1972 .

[34]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[35]  T. Lai,et al.  Optimal learning and experimentation in bandit problems , 2000 .

[36]  S. Villar BANDIT STRATEGIES EVALUATED IN THE CONTEXT OF CLINICAL TRIALS IN RARE LIFE-THREATENING DISEASES , 2017, Probability in the Engineering and Informational Sciences.

[37]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[38]  Peter Whittle,et al.  Applied Probability in Great Britain , 2002, Oper. Res..

[39]  E. Kaufmann On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[40]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[41]  J. Banks,et al.  Denumerable-Armed Bandits , 1992 .

[42]  Thomas A. Kelley A Note on the Bernoulli Two-Armed Bandit Problem , 1974 .

[43]  Donald A. Berry,et al.  Modified Two-Armed Bandit Strategies for Certain Clinical Trials , 1978 .

[44]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[45]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[46]  D. Berry A Bernoulli Two-armed Bandit , 1972 .

[47]  George H. Weiss,et al.  Recent results on using the play the winner sampling rule with binomial selection problems , 1972 .

[48]  P. Thall,et al.  Practical Bayesian adaptive randomisation in clinical trials. , 2007, European journal of cancer.

[49]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[50]  F. Kelly Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case , 1981 .

[51]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[52]  Diane Uschner,et al.  Randomization: The forgotten component of the randomized clinical trial , 2018, Statistics in medicine.

[53]  Michael Hogarth,et al.  Adaptive Randomization of Neratinib in Early Breast Cancer. , 2016, The New England journal of medicine.

[54]  John R. Birge,et al.  Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients , 2016, Eur. J. Oper. Res..

[55]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[56]  Warren B. Powell,et al.  Optimal Learning , 2022, Encyclopedia of Machine Learning and Data Mining.

[57]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[58]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[59]  R. N. Bradt,et al.  On Sequential Designs for Maximizing the Sum of $n$ Observations , 1956 .

[60]  José Niòo-Mora Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[61]  Donald A. Berry,et al.  Optimal adaptive randomized designs for clinical trials , 2007 .

[62]  Warren B. Powell,et al.  A Knowledge-Gradient Policy for Sequential Information Collection , 2008, SIAM J. Control. Optim..

[63]  Abdel Hamid Randomized sequential decision rules : application to the multi-armed bandit problem and the secretary problem , 1981 .

[64]  J. Bather Randomized Allocation of Treatments in Sequential Experiments , 1981 .

[65]  Thomas Jaki,et al.  A Bayesian adaptive design for clinical trials in rare diseases , 2016, Comput. Stat. Data Anal..

[66]  Tze Leung Lai,et al.  Incomplete learning from endogenous data in dynamic allocation , 1999 .

[67]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[68]  H Robbins,et al.  Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[70]  D. Berry,et al.  Choosing sample size for a clinical trial using decision analysis , 2003 .

[71]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[72]  J. Bather Randomised allocation of treatments in sequential trials , 1980, Advances in Applied Probability.

[73]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .