Multinomial Logit Contextual Bandits: Provable Optimality and Practicality

We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a d-dimensional contextual information about the user and the N available items, and offers an assortment of size K to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method which achieves an Õ(d √ T ) regret over T rounds. Next, we propose a second algorithm which achieves a Õ( √ dT ) regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best known result by a √ d factor. To establish this sharper regret bound, we present a non-asymptotic confidence bound for the maximum likelihood estimator of the MNL model that may be of independent interest as its own theoretical contribution. We then revisit the simpler, significantly more practical, first algorithm and show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.

[1]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[2]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[3]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[4]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[5]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[6]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[7]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[8]  Xi Chen,et al.  A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models , 2017, ArXiv.

[9]  Beibei Li,et al.  Examining the Impact of Ranking on Consumer Behavior and Search Engine Revenue , 2013, Manag. Sci..

[10]  Xi Chen,et al.  Dynamic Assortment Optimization with Changing Contextual Information , 2018, J. Mach. Learn. Res..

[11]  Daniel McFadden,et al.  Modelling the Choice of Residential Location , 1977 .

[12]  Huseyin Topaloglu,et al.  Assortment Optimization Under Variants of the Nested Logit Model , 2014, Oper. Res..

[13]  Vashist Avadhanula,et al.  A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[14]  Craig Boutilier,et al.  Randomized Exploration in Generalized Linear Bandits , 2019, AISTATS.

[15]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[16]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[17]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[18]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[19]  Adel Javanmard,et al.  Dynamic Pricing in High-Dimensions , 2016, J. Mach. Learn. Res..

[20]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[21]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[22]  Rong Jin,et al.  Multinomial Logit Bandit with Linear Utility Functions , 2018, IJCAI.

[23]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[24]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[25]  Wei Cao,et al.  On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[26]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[27]  Danny Segev,et al.  Greedy-Like Algorithms for Dynamic Assortment Planning Under Multinomial Logit Preferences , 2015, Oper. Res..

[28]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[29]  Zhi-Hua Zhou,et al.  Online Stochastic Linear Optimization under One-bit Feedback , 2015, ICML.

[30]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[31]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[32]  David Simchi-Levi,et al.  Thompson Sampling for Online Personalized Assortment Optimization Problems with Multinomial Logit Choice Models , 2017 .

[33]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[34]  G. Gallego,et al.  Assortment Planning Under the Multinomial Logit Model with Totally Unimodular Constraint Structures , 2013 .

[35]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[36]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[37]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[38]  Min-hwan Oh,et al.  Thompson Sampling for Multinomial Logit Contextual Bandits , 2019, NeurIPS.

[39]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[40]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[41]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[42]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[43]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.