Know Your Customer: Multi-armed Bandits with Capacity Constraints

A wide range of resource allocation and platform operation settings exhibit the following two simultaneous challenges: (1) service resources are capacity constrained; and (2) clients' preferences are not perfectly known. To study this pair of challenges, we consider a service system with heterogeneous servers and clients. Server types are known and there is fixed capacity of servers of each type. Clients arrive over time, with types initially unknown and drawn from some distribution. Each client sequentially brings $N$ jobs before leaving. The system operator assigns each job to some server type, resulting in a payoff whose distribution depends on the client and server types. Our main contribution is a complete characterization of the structure of the optimal policy for maximization of the rate of payoff accumulation. Such a policy must balance three goals: (i) earning immediate payoffs; (ii) learning client types to increase future payoffs; and (iii) satisfying the capacity constraints. We construct a policy that has provably optimal regret (to leading order as $N$ grows large). Our policy has an appealingly simple three-phase structure: a short type-"guessing" phase, a type-"confirmation" phase that balances payoffs with learning, and finally an "exploitation" phase that focuses on payoffs. Crucially, our approach employs the shadow prices of the capacity constraints in the assignment problem with known types as "externality prices" on the servers' capacity.

[1]  Mohammad Akbarpour,et al.  Dynamic matching market design , 2014, EC.

[2]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[3]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[4]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[5]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[6]  Nikhil R. Devanur,et al.  Contextual Bandits with Global Constraints and Objective , 2015, ArXiv.

[7]  R. Agrawal,et al.  Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space , 1989 .

[8]  Laurent Massoulié,et al.  On the capacity of information processing systems , 2016, COLT.

[9]  Qing Zhao,et al.  Adaptive shortest-path routing under unknown and stochastically varying link states , 2012, 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt).

[10]  Constantinos Maglaras,et al.  Pricing and Design of Differentiated Services: Approximate Analysis and Structural Insights , 2005, Oper. Res..

[11]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[12]  John Langford,et al.  Resourceful Contextual Bandits , 2014, COLT.

[13]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[14]  Omar Besbes,et al.  Blind Network Revenue Management , 2011, Oper. Res..

[15]  D. Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017 .

[16]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[17]  Robert D. Kleinberg,et al.  Learning on a budget: posted price mechanisms for online procurement , 2012, EC '12.

[18]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[19]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[20]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[21]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[22]  Leeat Yariv,et al.  Optimal Dynamic Matching , 2015, Theoretical Economics.

[23]  Sanmay Das,et al.  Two-Sided Bandits and the Dating Market , 2005, IJCAI.

[24]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[25]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[26]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[27]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[28]  Constantinos Maglaras,et al.  Pricing and Capacity Sizing for Systems with Shared Resources: Approximate Solutions and Scaling Relations , 2003, Manag. Sci..

[29]  Yashodhan Kanoria,et al.  Matching while Learning , 2016, EC.

[30]  Itai Ashlagi,et al.  A dynamic model of barter exchange , 2015, SODA.

[31]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Global Constraints and Objective , 2015, ArXiv.

[32]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[33]  Yeneng Sun,et al.  The exact law of large numbers via Fubini extension and characterization of insurable risks , 2006, J. Econ. Theory.

[34]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[35]  Aranyak Mehta,et al.  Online Matching and Ad Allocation , 2013, Found. Trends Theor. Comput. Sci..

[36]  Ettore Damiano,et al.  Stability in dynamic matching markets , 2005, Games Econ. Behav..

[37]  S. Sushanth Kumar,et al.  Heavy traffic analysis of open processing networks with complete resource pooling: Asymptotic optimality of discrete review policies , 2005, math/0503477.

[38]  Leandros Tassiulas,et al.  Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , 1990, 29th IEEE Conference on Decision and Control.

[39]  Zizhuo Wang,et al.  Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems , 2014, Oper. Res..

[40]  J. Feldman,et al.  Learning mixtures of product distributions over discrete domains , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[41]  W. Blischke Estimating the Parameters of Mixtures of Binomial Distributions , 1964 .

[42]  Amy R. Ward,et al.  Dynamic Matching for Real-Time Ridesharing , 2017, Stochastic Systems.

[43]  J. Dai On Positive Harris Recurrence of Multiclass Queueing Networks: A Unified Approach Via Fluid Limit Models , 1995 .

[44]  L. Shapley,et al.  The assignment game I: The core , 1971 .

[45]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[46]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[47]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[48]  S. Kadam,et al.  Multiperiod Matching , 2018, International Economic Review.

[49]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[50]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[51]  R. Srikant,et al.  Network Optimization and Control , 2008, Found. Trends Netw..

[52]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).