MNL-Bandit: A Dynamic Learning Approach to Assortment Selection

We consider a dynamic assortment selection problem, where in every round the retailer offers a subset (assortment) of $N$ substitutable products to a consumer, who selects one of these products according to a multinomial logit (MNL) choice model. The retailer observes this choice and the objective is to dynamically learn the model parameters, while optimizing cumulative revenues over a selling horizon of length $T$. We refer to this exploration-exploitation formulation as the MNL-Bandit problem. Existing methods for this problem follow an "explore-then-exploit" approach, which estimate parameters to a desired accuracy and then, treating these estimates as if they are the correct parameter values, offers the optimal assortment based on these estimates. These approaches require certain a priori knowledge of "separability", determined by the true parameters of the underlying MNL model, and this in turn is critical in determining the length of the exploration period. (Separability refers to the distinguishability of the true optimal assortment from the other sub-optimal alternatives.) In this paper, we give an efficient algorithm that simultaneously explores and exploits, achieving performance independent of the underlying parameters. The algorithm can be implemented in a fully online manner, without knowledge of the horizon length $T$. Furthermore, the algorithm is adaptive in the sense that its performance is near-optimal in both the "well separated" case, as well as the general parameter setting where this separation need not hold.

[1]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[2]  Garrett J. van Ryzin,et al.  Revenue Management Under a General Discrete Choice Model of Consumer Behavior , 2004, Manag. Sci..

[3]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[4]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[5]  Guang Li,et al.  The d-Level Nested Logit Model: Assortment and Price Optimization Problems , 2015, Oper. Res..

[6]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[7]  J. Blanchet,et al.  A markov chain approximation to choice modeling , 2013, EC '13.

[8]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[9]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[10]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[11]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[12]  Huseyin Topaloglu,et al.  Constrained Assortment Optimization for the Nested Logit Model , 2014, Manag. Sci..

[13]  Daniel McFadden,et al.  Modelling the Choice of Residential Location , 1977 .

[14]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[15]  Wei Chen,et al.  Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[16]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[17]  Vashist Avadhanula,et al.  On the tightness of an LP relaxation for rational optimization and its applications , 2016, Oper. Res. Lett..

[18]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[19]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[22]  G. Gallego,et al.  Assortment Planning Under the Multinomial Logit Model with Totally Unimodular Constraint Structures , 2013 .

[23]  H. Williams On the Formation of Travel Demand Models and Economic Evaluation Measures of User Benefit , 1977 .

[24]  Danny Segev,et al.  Capacity Constrained Assortment Optimization Under the Markov Chain Based Choice Model , 2015 .

[25]  Vineet Goyal,et al.  Near-Optimal Algorithms for Capacity Constrained Assortment Optimization , 2014 .

[26]  Marshall L. Fisher,et al.  Demand Estimation and Assortment Optimization Under Substitution: Methodology and Application , 2007, Oper. Res..

[27]  Moshe Ben-Akiva,et al.  Discrete Choice Analysis: Theory and Application to Travel Demand , 1985 .

[28]  R. Plackett The Analysis of Permutations , 1975 .

[29]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[30]  R. Luce,et al.  Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[31]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[32]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[33]  Florian Heiss,et al.  Discrete Choice Methods with Simulation , 2016 .

[34]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[35]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[36]  Devavrat Shah,et al.  A Nonparametric Approach to Modeling Choice with Limited Data , 2009, Manag. Sci..

[37]  Richard Ratliff,et al.  A General Attraction Model and Sales-Based Linear Program for Network Revenue Management Under Customer Choice , 2015, Oper. Res..