论文信息 - Exploring Best Arm with Top Reward-Cost Ratio in Stochastic Bandits - 字舞流文

Exploring Best Arm with Top Reward-Cost Ratio in Stochastic Bandits

The best arm identification problem in multi-armed bandit model has been widely applied into many practical applications, such as spectrum sensing, online advertising, and cloud computing. Although lots of works have been devoted into this area, most of them do not consider the cost of pulling actions, i.e., a player has to pay some cost when she pulls an arm. Motivated by this, we study a ratio-based best arm identification problem, where each arm is associated with a random reward as well as a random cost. For any δ ∈ (0,1), with probability at least 1−δ, the player aims to find the optimal arm with the largest ratio of expected reward to expected cost using as few samplings as possible. To solve this problem, we propose three algorithms: 1) a genie-aided algorithm GA; 2) the successive elimination algorithm with unknown gaps SEUG; 3) the successive elimination algorithm with unknown gaps and variance information SEUG-V, where gaps denote the differences between the optimal arm and the suboptimal arms. We show that for all three algorithms, the sample complexities, i.e., the pulling times for all arms, grow logarithmically as $\frac{1}{\delta }$ increases. Moreover, compared to existing works, the running of our elimination-type algorithms is independent of the arm-related parameters, which is more practical. In addition, we also provide a fundamental lower bound for sample complexities of any algorithms under Bernoulli distributions, and show that the sample complexities of the proposed three algorithms match that of the lower bound in the sense of $\log \frac{1}{\delta }$. Finally, we validate our theoretical results through numerical experiments.

Xiaoying Gan | Jia Liu | Luoyi Fu | Zhida Qin | Haiming Jin | Hongqiu Wu | Jia Liu | Haiming Jin | Luoyi Fu | Xiaoying Gan | Zhida Qin | Hongqiu Wu

[1] Shivaram Kalyanakrishnan,et al. PAC Identification of a Bandit Arm Relative to a Reward Quantile , 2017, AAAI.

[2] Yifan Wu,et al. On Identifying Good Options under Combinatorially Structured Feedback in Finite Noisy Environments , 2015, ICML.

[3] R. Munos,et al. Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[4] Waldemar Susek,et al. Theory and Measurement of Signal-to-Noise Ratio in Continuous-Wave Noise Radar , 2018, Sensors.

[5] Jia Liu,et al. Combinatorial Sleeping Bandits with Fairness Constraints , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[6] Ness B. Shroff,et al. Exploring k out of Top ρ Fraction of Arms in Stochastic Bandits , 2018, ArXiv.

[7] Nenghai Yu,et al. Best Action Selection in a Stochastic Environment , 2016, AAMAS.

[8] Jonathan Scarlett,et al. Overlapping Multi-Bandit Best Arm Identification , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[9] Jie Xu,et al. Task Replication for Vehicular Cloud: Contextual Combinatorial Bandit with Delayed Feedback , 2018, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[10] Alessandro Lazaric,et al. Best-Arm Identification in Linear Bandits , 2014, NIPS.

[11] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[12] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[13] Yingce Xia,et al. Infinitely Many-Armed Bandits with Budget Constraints , 2016, AAAI.

[14] Claire J. Tomlin,et al. Budget-Constrained Multi-Armed Bandits with Multiple Plays , 2017, AAAI.

[15] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[16] Ambuj Tewari,et al. PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[17] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[18] Wei Cao,et al. On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[19] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[20] Andrew W. Moore,et al. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[21] Xi Chen,et al. Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing , 2014, ICML.

[22] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[23] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[24] Russell Greiner,et al. The Budgeted Multi-armed Bandit Problem , 2004, COLT.

[25] Ness B. Shroff,et al. Efficient Beam Alignment in Millimeter Wave Systems Using Contextual Bandits , 2017, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[26] Yuan Xue,et al. Pure-Exploration Bandits for Channel Selection in Mission-Critical Wireless Communications , 2018, IEEE Transactions on Vehicular Technology.

[27] Alessandro Lazaric,et al. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[28] Michal Valko,et al. Simple regret for infinitely many armed bandits , 2015, ICML.

[29] Alessandro Lazaric,et al. Multi-Bandit Best Arm Identification , 2011, NIPS.

[30] Peter Stone,et al. Efficient Selection of Multiple Bandit Arms: Theory and Practice , 2010, ICML.