An Optimal Algorithm for the Stochastic Bandits with Knowing Near-optimal Mean Reward

This paper studies a variation of stochastic multi-armed bandit (MAB) problem where the agent knows a prior knowledge named Near-optimal Mean Reward (NoMR). We show that the cumulative regret of this bandit variation has a lower bound of Ω Δeft(1/Δ), where Δ is the gap between the optimal and the second optimal mean reward. An algorithm called NoMR-Bandit is proposed to this variation, and we demonstrate that the cumulative regret of NoMR-Bandit has a uniform upper bound of l(Δ). It is concluded that NoMR-Bandit is optimal in terms of the order of regret bounds.

[1]  Ariel D. Procaccia,et al.  On the complexity of achieving proportional representation , 2008, Soc. Choice Welf..

[2]  Alexandra Carpentier,et al.  An optimal algorithm for the Thresholding Bandit Problem , 2016, ICML.

[3]  Jianxin Wang,et al.  Multiwinner Voting with Restricted Admissible Sets: Complexity and Strategyproofness , 2018, IJCAI.

[4]  D. Marc Kilgour,et al.  Approval Balloting for Fixed-Size Committees , 2012 .

[5]  Edith Elkind,et al.  Multiwinner Elections Under Preferences That Are Single-Peaked on a Tree , 2013, IJCAI.

[6]  Piotr Faliszewski,et al.  Committee Scoring Rules: Axiomatic Classification and Hierarchy , 2016, IJCAI.

[7]  Vianney Perchet,et al.  Bounded regret in stochastic multi-armed bandits , 2013, COLT.

[8]  Olivier Spanjaard,et al.  Bounded Single-Peaked Width and Proportional Representation , 2012, ECAI.

[9]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[10]  Jean-Yves Audibert,et al.  Deviations of Stochastic Bandit Regret , 2011, ALT.

[11]  Jöran Beel,et al.  The Impact of Demographics (Age and Gender) and Other User-Characteristics on Evaluating Recommender Systems , 2013, TPDL.

[12]  John R. Chamberlin,et al.  Representative Deliberations and Representative Decisions: Proportional Representation and the Borda Rule , 1983, American Political Science Review.

[13]  Piotr Faliszewski,et al.  Bribery as a Measure of Candidate Success: Complexity Results for Approval-Based Multiwinner Rules , 2017, AAMAS.

[14]  Ariel D. Procaccia,et al.  Multi-Winner Elections: Complexity of Manipulation, Control and Winner-Determination , 2007, IJCAI.

[15]  Rob LeGrand Analysis of the Minimax Procedure , 2004 .

[16]  Qing Zhao,et al.  Achieving complete learning in Multi-Armed Bandit problems , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[17]  Nadja Betzler,et al.  On the Computation of Fully Proportional Representation , 2011, J. Artif. Intell. Res..

[18]  Martin Lackner,et al.  Consistent Approval-Based Multi-Winner Rules , 2017, EC.

[19]  Haris Aziz,et al.  Justified representation in approval-based committee voting , 2014, Social Choice and Welfare.

[20]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[21]  Steven J. Brams,et al.  A minimax procedure for electing committees , 2007 .

[22]  Neeldhara Misra,et al.  On the Parameterized Complexity of Minimax Approval Voting , 2015, AAMAS.

[23]  D. Marc Kilgour,et al.  Approval Balloting for Multi-winner Elections , 2010 .

[24]  Dong Wang,et al.  Click-through Prediction for Advertising in Twitter Timeline , 2015, KDD.