Pure Exploration Bandit Problem with General Reward Functions Depending on Full Distributions

In this paper, we study the pure exploration bandit model on general distribution functions, which means that the reward function of each arm depends on the whole distribution, not only its mean. We adapt the racing framework and LUCB framework to solve this problem, and design algorithms for estimating the value of the reward functions with different types of distributions. Then we show that our estimation methods have correctness guarantee with proper parameters, and obtain sample complexity upper bounds for them. Finally, we discuss about some important applications and their corresponding solutions under our learning framework.

[1]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[2]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[3]  Nahum Shimkin,et al.  Pure Exploration for Max-Quantile Bandits , 2016, ECML/PKDD.

[4]  Peter Stone,et al.  Efficient Selection of Multiple Bandit Arms: Theory and Practice , 2010, ICML.

[5]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[6]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[7]  N. Radziwill Six Sigma Case Studies with Minitab , 2014 .

[8]  K. Sh. Zigangirov,et al.  On a Problem in Optimal Scanning , 1966 .

[9]  Eyke Hüllermeier,et al.  Qualitative Multi-Armed Bandits: A Quantile-Based Approach , 2015, ICML.

[10]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[11]  Lenon Beeson,et al.  Statistics in the Real World: A Book of Examples , 1977 .

[12]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[13]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[14]  G. Barrera,et al.  Thermalisation for Stochastic Small Random Perturbations of Hyperbolic Dynamical Systems , 2015 .

[15]  Vladimir Dragalin A simple and effective scanning rule for a multi-channel system , 1996 .

[16]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit with General Reward Functions , 2016, NIPS.

[17]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[18]  Charles B. Davis Compliance Quantified: An Introduction to Data Verification , 1996, Technometrics.

[19]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[20]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[21]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[22]  Thierry Paquet,et al.  Handwriting analysis for writer verification , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[25]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[26]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[27]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .