论文信息 - Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit - 字舞流文

Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most $K$, out of $N$ items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$, where $\kappa$ is an instance dependent constant. $\kappa$ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret. In our bounds, the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.

Vashist Avadhanula | Theja Tulabandhula | Priyank Agrawal | Theja Tulabandhula | Priyank Agrawal | Vashist Avadhanula

[1] Marc Abeille,et al. Improved Optimistic Algorithms for Logistic Bandits , 2020, ICML.

[2] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[3] Felipe Caro,et al. Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[4] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[5] Min-hwan Oh,et al. Thompson Sampling for Multinomial Logit Contextual Bandits , 2019, NeurIPS.

[6] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[7] Lacra Pavel,et al. On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[8] Tor Lattimore,et al. Adaptive Exploration in Linear Contextual Bandit , 2020, AISTATS.

[9] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[10] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[12] Vashist Avadhanula,et al. MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[13] David B. Shmoys,et al. Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[14] Xi Chen,et al. Dynamic Assortment Optimization with Changing Contextual Information , 2018, J. Mach. Learn. Res..

[15] Francis R. Bach,et al. Self-concordant analysis for logistic regression , 2009, ArXiv.

[16] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[17] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[18] Min-hwan Oh,et al. Multinomial Logit Contextual Bandits , 2019 .

[19] Assaf J. Zeevi,et al. Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[20] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[21] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[22] Julien Mairal,et al. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[23] Vashist Avadhanula,et al. Thompson Sampling for the MNL-Bandit , 2017, COLT.

[24] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.