MNL-Bandit with Knapsacks

In this paper, we study a dynamic assortment optimization problem under bandit feedback, where a seller with a fixed initial inventory of N substitutable products faces a sequence of i.i.d. customer arrivals (with an unknown distribution) over a time horizon of T periods, and needs to decide in each period on an assortment of products to offer to the customer to maximize the total expected revenue. Such a problem arises in many applications including online retail and recommendations. The seller has initially no (or only limited) information about the customer's preferences and needs to learn them through repeated interaction with the i.i.d. customers. Specifically, in each period, the seller offers an assortment to the customer; the customer makes a choice from the assortment according to the unknown preferences or choice model, and the seller only observes the eventual choice from the given assortment and needs to update the estimate and future actions under this bandit feedback. Therefore, this problem exemplifies the classical trade-off between exploitation and exploration: the seller needs to simultaneously gain information about the customer's preferences and offer revenue-maximizing assortments, while respecting the resource constraints.

[1]  Nikhil R. Devanur,et al.  Bandits with Global Convex Constraints and Objective , 2019, Oper. Res..

[2]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[3]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[4]  Nikhil R. Devanur,et al.  Near optimal online algorithms and fast approximation algorithms for resource allocation problems , 2011, EC '11.

[5]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[6]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[7]  Daniel McFadden,et al.  Modelling the Choice of Residential Location , 1977 .

[8]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[9]  S. Janson Tail bounds for sums of geometric and exponential variables , 2017, 1709.08157.

[10]  Vineet Goyal,et al.  Near-Optimal Algorithms for Capacity Constrained Assortment Optimization , 2014 .

[11]  R. Plackett The Analysis of Permutations , 1975 .

[12]  David Simchi-Levi,et al.  Inventory Balancing with Online Learning , 2018, ArXiv.

[13]  Vahab S. Mirrokni,et al.  Dual Mirror Descent for Online Allocation Problems , 2020, ICML.

[14]  SaberiAmin,et al.  AdWords and generalized online matching , 2007 .

[15]  Zizhuo Wang,et al.  A Dynamic Near-Optimal Algorithm for Online Linear Programming , 2009, Oper. Res..

[16]  G. Iyengar,et al.  Managing Flexible Products on a Network , 2004 .

[17]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[18]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[19]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[20]  Arne Strauss,et al.  Network revenue management with inventory-sensitive bid prices and customer choice , 2012, Eur. J. Oper. Res..

[21]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[22]  G. Gallego,et al.  Assortment Planning Under the Multinomial Logit Model with Totally Unimodular Constraint Structures , 2013 .

[23]  Hamid Nazerzadeh,et al.  Real-time optimization of personalized assortments , 2013, EC '13.

[24]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[25]  Huseyin Topaloglu,et al.  A New Dynamic Programming Decomposition Method for the Network Revenue Management Problem with Customer Choice Behavior , 2010 .

[26]  Rémi Munos,et al.  Pure Exploration for Multi-Armed Bandit Problems , 2008, ArXiv.

[27]  Varun Grover,et al.  Active Learning in Multi-armed Bandits , 2008, ALT.

[28]  Xi Chen,et al.  Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models , 2018, NeurIPS.

[29]  Thomas P. Hayes,et al.  The adwords problem: online keyword matching with budgeted bidders under random permutations , 2009, EC '09.

[30]  Michael J. Todd,et al.  Feature Article - The Ellipsoid Method: A Survey , 1981, Oper. Res..

[31]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[32]  Robert D. Carr,et al.  Randomized metarounding (extended abstract) , 2000, STOC '00.

[33]  D. Simchi-Levi,et al.  Multi-Stage and Multi-Customer Assortment Optimization With Inventory Constraints , 2019, SSRN Electronic Journal.

[34]  W. Lieberman The Theory and Practice of Revenue Management , 2005 .

[35]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[36]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[37]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[38]  Vahab S. Mirrokni,et al.  Tight Approximation Algorithms for Maximum Separable Assignment Problems , 2019 .

[39]  Van-Anh Truong,et al.  Online Resource Allocation with Customer Choice , 2015, 1511.01837.

[40]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[41]  David Simchi-Levi,et al.  Assortment Optimization under Unknown MultiNomial Logit Choice Models , 2017, ArXiv.

[42]  Lei Xie,et al.  Dynamic Assortment Customization with Limited Inventories , 2015, Manuf. Serv. Oper. Manag..