Linear Contextual Bandits with Knapsacks

We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn't exceed the budget for each resource. The objective is once again to maximize the total reward. This problem turns out to be a common generalization of classic linear contextual bandits (linContextual), bandits with knapsacks (BwK), and the online stochastic packing problem (OSPP). We present algorithms with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the unstructured version of the problem where the relation between the contexts and the outcomes could be arbitrary, but the algorithm only competes against a fixed set of policies accessible through an optimization oracle. We combine techniques from the work on linContextual, BwK, and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases.

[1]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[2]  Russell Greiner,et al.  The Budgeted Multi-armed Bandit Problem , 2004, COLT.

[3]  Sandeep Pandey,et al.  Handling Advertisements of Unknown Quality in Search Advertising , 2006, NIPS.

[4]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, FOCS.

[5]  András György,et al.  Continuous Time Associative Bandit Problems , 2007, IJCAI.

[6]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[7]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[8]  Thomas P. Hayes,et al.  The adwords problem: online keyword matching with budgeted bidders under random permutations , 2009, EC '09.

[9]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[10]  Jon Feldman,et al.  Online Stochastic Packing Applied to Display Ad Allocation , 2010, ESA.

[11]  Sergei Vassilvitskii,et al.  Optimal online assignment with forecasts , 2010, EC '10.

[12]  Archie C. Chapman,et al.  Epsilon-First Policies for Budget-Limited Multi-Armed Bandits , 2010, AAAI.

[13]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[14]  Aranyak Mehta,et al.  Online bipartite matching with unknown distributions , 2011, STOC '11.

[15]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[16]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[17]  Mohammad Mahdian,et al.  Online bipartite matching with random arrivals: an approach based on strongly factor-revealing LPs , 2011, STOC '11.

[18]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[19]  Nikhil R. Devanur,et al.  Near optimal online algorithms and fast approximation algorithms for resource allocation problems , 2011, EC '11.

[20]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[21]  Robert D. Kleinberg,et al.  Learning on a budget: posted price mechanisms for online procurement , 2012, EC '12.

[22]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[23]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[24]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[25]  Xiao Chen,et al.  A Near-Optimal Dynamic Learning Algorithm for Online Matching Problems with Concave Returns , 2013, ArXiv.

[26]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[27]  Aleksandrs Slivkins,et al.  Online decision making in crowdsourcing markets: theoretical challenges , 2013, SECO.

[28]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[29]  Berthold Vöcking,et al.  Primal beats dual on online packing LPs in the random-order model , 2013, STOC.

[30]  John Langford,et al.  Resourceful Contextual Bandits , 2014, COLT.

[31]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[32]  Zizhuo Wang,et al.  A Dynamic Near-Optimal Algorithm for Online Linear Programming , 2009, Oper. Res..

[33]  Deepayan Chakrabarti,et al.  Traffic Shaping to Optimize Ad Delivery , 2015, ACM Trans. Economics and Comput..

[34]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[35]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[36]  Nikhil R. Devanur,et al.  Fast Algorithms for Online Stochastic Convex Programming , 2014, SODA.

[37]  Marco Molinaro,et al.  How the Experts Algorithm Can Help Solve LPs Online , 2014, Math. Oper. Res..

[38]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.