Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most $K$, out of $N$ items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$, where $\kappa$ is an instance dependent constant. $\kappa$ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret. In our bounds, the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.

[1]  Marc Abeille,et al.  Improved Optimistic Algorithms for Logistic Bandits , 2020, ICML.

[2]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[3]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[4]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[5]  Min-hwan Oh,et al.  Thompson Sampling for Multinomial Logit Contextual Bandits , 2019, NeurIPS.

[6]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[7]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[8]  Tor Lattimore,et al.  Adaptive Exploration in Linear Contextual Bandit , 2020, AISTATS.

[9]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[12]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[13]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[14]  Xi Chen,et al.  Dynamic Assortment Optimization with Changing Contextual Information , 2018, J. Mach. Learn. Res..

[15]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[16]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[17]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[18]  Min-hwan Oh,et al.  Multinomial Logit Contextual Bandits , 2019 .

[19]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[20]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[21]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[22]  Julien Mairal,et al.  Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[23]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[24]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.