Information Complexity in Bandit Subset Selection

We consider the problem of eciently exploring the arms of a stochastic bandit to identify the best subset of a specied size. Under the PAC and the xed-budget formulations, we derive improved bounds by using KL-divergence-based condence intervals. Whereas the application of a similar idea in the regret setting has yielded bounds in terms of the KL-divergence between the arms, our bounds in the pure-exploration setting involve the \Cherno information" between the arms. In addition to introducing this novel quantity to the bandits literature, we contribute a comparison between strategies based on uniform and adaptive sampling for pure-exploration problems, nding evidence in favor of the latter.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[5]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[8]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[9]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[10]  Peter Stone,et al.  Efficient Selection of Multiple Bandit Arms: Theory and Practice , 2010, ICML.

[11]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[12]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[13]  Shivaram Kalyanakrishnan Learning Methods for Sequential Decision Making with Imperfect Representations by Shivaram Kalyanakrishnan , 2011 .

[14]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[15]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[16]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[17]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[18]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[19]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[20]  Tengyao Wang,et al.  Multiple Identications in Multi-Armed Bandits , 2013 .

[21]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.