Bridging the gap between regret minimization and best arm identification, with application to A/B tests

State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best arm identification. This analysis sheds light on the common observation that ill-callibrated UCB-algorithms minimize regret while still identifying quickly the best arm. We also extend these results to the non-iid case faced by many practitioners. This provides a technique to make cost versus decision time compromise when doing adaptive tests with applications ranging from website A/B testing to clinical trials.

[1]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[2]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[3]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[4]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[5]  Stefano Ermon,et al.  Adaptive Concentration Inequalities for Sequential Decision Problems , 2016, NIPS.

[6]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[7]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[8]  D. Berry,et al.  Adaptive assignment versus balanced randomization in clinical trials: a decision analysis. , 1995, Statistics in medicine.

[9]  Tor Lattimore,et al.  On Explore-Then-Commit strategies , 2016, NIPS.

[10]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[11]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[12]  Margaret L. Brandeau,et al.  Dynamic Learning of Patient Response Types: An Application to Treating Chronic Diseases , 2017, Manag. Sci..

[13]  H Robbins,et al.  Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[15]  Martin J. Wainwright,et al.  A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control , 2017, NIPS.

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.

[18]  William H Press,et al.  Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research , 2009, Proceedings of the National Academy of Sciences.

[19]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[20]  Aurélien Garivier,et al.  On the Complexity of A/B Testing , 2014, COLT.

[21]  L. Pekelis,et al.  The New Stats Engine , 2015 .

[22]  Pete Koomen,et al.  Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.

[23]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[24]  C. Assaid,et al.  The Theory of Response-Adaptive Randomization in Clinical Trials , 2007 .