论文信息 - Thompson Sampling for the MNL-Bandit - 字舞流文

Thompson Sampling for the MNL-Bandit

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Vashist Avadhanula | Shipra Agrawal | Vineet Goyal | Assaf J. Zeevi | A. Zeevi | Shipra Agrawal | Vineet Goyal | Vashist Avadhanula

[1] Assaf J. Zeevi,et al. Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[2] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[3] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[4] Vashist Avadhanula,et al. On the tightness of an LP relaxation for rational optimization and its applications , 2016, Oper. Res. Lett..

[5] Moshe Ben-Akiva,et al. Discrete Choice Analysis: Theory and Application to Travel Demand , 1985 .

[6] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[7] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[8] K. Train. Discrete Choice Methods with Simulation , 2003 .

[9] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[10] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[12] R. Duncan Luce,et al. Individual Choice Behavior: A Theoretical Analysis , 1979 .

[13] R. Luce,et al. Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[14] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[15] David B. Shmoys,et al. Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[16] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[17] G. Gallego,et al. Assortment Planning Under the Multinomial Logit Model with Totally Unimodular Constraint Structures , 2013 .

[18] R. Plackett. The Analysis of Permutations , 1975 .

[19] Vashist Avadhanula,et al. A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[20] M. Abramowitz,et al. Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[21] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[22] Benjamin Van Roy,et al. A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[23] Daniel McFadden,et al. Modelling the Choice of Residential Location , 1977 .

[24] Xi Chen,et al. A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models , 2017, ArXiv.

[25] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..