Budgeted Multi-Armed Bandit in Continuous Action Space

Multi–Armed Bandits (MABs) have been widely considered in the last decade to model settings in which an agent wants to learn the action providing the highest expected reward among a fixed set of available actions during the operational life of a system. Classical techniques provide solutions that minimize the regret due to learning in settings where selecting an arm has no cost. Though, in many real world applications the learner has to pay some cost for pulling each arm and the learning process is constrained by a fixed budgetB. This problem is addressed in the literature as the Budgeted MAB (BMAB). In this paper, for the first time, we study the problem of Budgeted Continuous–Armed Bandit (BCAB), where the set of the possible actions consists in a continuous set (e.g., a range of prices) and the learner suffers from a random reward and cost at each round. We provide a novel algorithm, named B–Zoom, which suffers a regret of Õ(B d+1 d+2 ), where d is the Zooming dimension of the problem. Finally, we provide an empirical analysis showing that, despite a lower average performance, the proposed approach is more robust to adverse settings as compared to existing algorithms designed for BMAB.

[1]  Varun Grover,et al.  Active Learning in Multi-armed Bandits , 2008, ALT.

[2]  Barbara Panicucci,et al.  A game theoretic formulation of the service provisioning problem in cloud systems , 2011, WWW.

[3]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[4]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[5]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[6]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[7]  Jia Yuan Yu,et al.  Lipschitz Bandits without the Lipschitz Constant , 2011, ALT.

[8]  Nicholas R. Jennings,et al.  A utility-based adaptive sensing and multihop communication protocol for wireless sensor networks , 2010, TOSN.

[9]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[10]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[11]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[12]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[13]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[15]  Nicholas R. Jennings,et al.  Long-term information collection with energy harvesting wireless sensors: a multi-armed bandit based approach , 2011, Autonomous Agents and Multi-Agent Systems.

[16]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[17]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[18]  Nicole Immorlica,et al.  Dynamics of bid optimization in online advertisement auctions , 2007, WWW '07.