Gamification of Pure Exploration for Linear Bandits

We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new insight over different notions of optimality in the linear case, including G-optimality, transductive optimality from optimal experimental designand asymptotic optimality. Second, we design the first asymptotically optimal algorithm for fixed-confidence pure exploration in linear bandits. As a consequence, our algorithm naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly. Finally, we avoid the need to fully solve an optimal design problem by providing an approach that entails an efficient implementation.

[1]  Corwin L. Atwood,et al.  Optimal and Efficient Designs of Experiments , 1969 .

[2]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[3]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[4]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[5]  Masashi Sugiyama,et al.  Fully adaptive algorithm for pure exploration in linear bandits , 2017, 1710.05552.

[6]  J. Kiefer,et al.  Optimum Designs in Regression Problems , 1959 .

[7]  Yurii Nesterov,et al.  Relatively Smooth Convex Optimization by First-Order Methods, and Applications , 2016, SIAM J. Optim..

[8]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[9]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[10]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[11]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[12]  Wouter M. Koolen,et al.  Pure Exploration with Multiple Correct Answers , 2019, NeurIPS.

[13]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[14]  Lalit Jain,et al.  Sequential Experimental Design for Transductive Linear Bandits , 2019, NeurIPS.

[15]  Peng Sun,et al.  Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids , 2008, Optim. Methods Softw..

[16]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[17]  Wouter M. Koolen,et al.  Non-Asymptotic Pure Exploration by Solving Games , 2019, NeurIPS.

[18]  Lawrence M. Wein,et al.  Best Arm Identification in Generalized Linear Bandits , 2019, Oper. Res. Lett..

[19]  Eiji Takimoto,et al.  Efficient Sampling Method for Monte Carlo Tree Search Problem , 2014, IEICE Trans. Inf. Syst..

[20]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[21]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[22]  Aditya Gopalan,et al.  Towards Optimal and Efficient Best Arm Identification in Linear Bandits , 2019, ArXiv.

[23]  Alexandra Carpentier,et al.  An optimal algorithm for the Thresholding Bandit Problem , 2016, ICML.

[24]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[25]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.

[26]  Rémi Munos,et al.  Pure Exploration for Multi-Armed Bandit Problems , 2008, ArXiv.

[27]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[28]  Wouter M. Koolen,et al.  Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling , 2018, NeurIPS.

[29]  Wouter M. Koolen,et al.  Structure Adaptive Algorithms for Stochastic Bandits , 2020, ICML.

[30]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[31]  Robert D. Nowak,et al.  Anytime Exploration for Multi-armed Bandits using Confidence Information , 2016, ICML.

[32]  Diego Klabjan,et al.  Improving the Expected Improvement Algorithm , 2017, NIPS.

[33]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[34]  Pierre Ménard,et al.  Gradient Ascent for Active Exploration in Bandit Problems , 2019, ArXiv.

[35]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[36]  Wouter M. Koolen,et al.  Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..

[37]  Yuan Zhou,et al.  Best Arm Identification in Linear Bandits with Linear Dimension Dependency , 2018, ICML.

[38]  Peter Stone,et al.  Efficient Selection of Multiple Bandit Arms: Theory and Practice , 2010, ICML.

[39]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[40]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .