Multinomial Logit Contextual Bandits

We consider a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint K from a set of N possible items. The sequence of assortments can be chosen as a function of the contextual information of items, and possibly users, and the goal is to maximize the expected cumulative rewards, or alternatively, minimize the expected regret. The distinguishing feature in our work is that feedback, i.e. the item chosen by the user, has a multinomial logistic distribution. We propose upper confidence interval based algorithms for this multinomial logit contextual bandit. The first algorithm is a simple and computationally more efficient method which achieves an Õ(d p T ) regret over T rounds with d dimensional feature vectors. The second algorithm inspired by the work of (Li et al., 2017) achieves an Õ( p dT ) with logarithmic dependence on N and increased computational complexity because of pruning processes.

[1]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[2]  Vineet Goyal,et al.  Near-Optimal Algorithms for Capacity Constrained Assortment Optimization , 2014 .

[3]  R. Plackett The Analysis of Permutations , 1975 .

[4]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[5]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[6]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[7]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[8]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[9]  Chandra R. Bhat,et al.  Modeling the choice continuum: an integrated model of residential location, auto ownership, bicycle ownership, and commute tour mode choice decisions , 2011 .

[10]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[11]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[12]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[13]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[14]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[15]  Huseyin Topaloglu,et al.  Assortment Optimization Under Variants of the Nested Logit Model , 2014, Oper. Res..

[16]  David Simchi-Levi,et al.  Assortment Optimization under Unknown MultiNomial Logit Choice Models , 2017, ArXiv.

[17]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[18]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[19]  Xi Chen,et al.  A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models , 2017, ArXiv.

[20]  Xi Chen,et al.  Dynamic Assortment Optimization with Changing Contextual Information , 2018, J. Mach. Learn. Res..

[21]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[22]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[23]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[24]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[25]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[26]  Garrett J. van Ryzin,et al.  Revenue Management Under a General Discrete Choice Model of Consumer Behavior , 2004, Manag. Sci..

[27]  Kani Chen,et al.  Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs , 1999 .