论文信息 - Multi-armed Bandits with Cost Subsidy

Multi-armed Bandits with Cost Subsidy

In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by a number of businesses (especially online platforms) and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish fundamental lower bound of $\Omega(K^{1/3} T^{2/3})$ on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem (where $T$ is the time horizon and $K$ is the number of arms). We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms.

Vashist Avadhanula | Deeksha Sinha | Abbas Kazerouni | Karthik Abinav Sankararama

[1] Samuel Daulton,et al. Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints , 2019, ArXiv.

[2] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3] Kirthevasan Kandasamy,et al. A Flexible Framework for Multi-Objective Bayesian Optimization using Random Scalarizations , 2018, UAI.

[4] Mark Huber. Nearly Optimal Bernoulli Factories for Linear Functions , 2016, Comb. Probab. Comput..

[5] Michèle Sebag,et al. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[6] Archie C. Chapman,et al. Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[7] Clayton Scott,et al. Top Feasible Arm Identification , 2019, AISTATS.

[8] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[9] Vashist Avadhanula,et al. A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[10] Christos Thrampoulidis,et al. Generalized Linear Bandits with Safety Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Claudio Gentile,et al. Ieee Transactions on Information Theory 1 Regret Minimization for Reserve Prices in Second-price Auctions , 2022 .

[12] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[13] Nikhil R. Devanur,et al. Bandits with concave rewards and convex knapsacks , 2014, EC.

[14] Aleksandrs Slivkins,et al. Online decision making in crowdsourcing markets: theoretical challenges , 2013, SECO.

[15] Ann Nowé,et al. Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[16] Bernard Manderick,et al. Annealing-pareto multi-objective multi-armed bandit algorithm , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[17] Wei Chen,et al. Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[18] Wei Cao,et al. On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[19] Yifan Wu,et al. Conservative Bandits , 2016, ICML.

[20] Ananthram Swami,et al. Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[21] Sanjay Shakkottai,et al. Social Learning in Multi Agent Multi Armed Bandits , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[22] Shipra Agrawal,et al. Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[23] Songwu Lu,et al. Analysis of the Reliability of a Nationwide Short Message Service , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[24] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[25] Songwu Lu,et al. A study of the short message service of a nationwide cellular network , 2006, IMC '06.

[26] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[27] Vashist Avadhanula,et al. Thompson Sampling for the MNL-Bandit , 2017, COLT.

[28] Robert D. Nowak,et al. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[29] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[30] Benjamin Van Roy,et al. Conservative Contextual Linear Bandits , 2016, NIPS.

[31] Aleksandrs Slivkins,et al. Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[32] Aleksandrs Slivkins,et al. Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[33] Jian Li,et al. Pure Exploration of Multi-armed Bandit Under Matroid Constraints , 2016, COLT.

[34] Lihong Li,et al. Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[35] Nicole Immorlica,et al. Adversarial Bandits with Knapsacks , 2018, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[36] George L. O'Brien,et al. A Bernoulli factory , 1994, TOMC.

[37] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[38] M. Abramowitz,et al. Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables , 1966 .

[39] Marnelli Canlas,et al. A quantitative analysis of the Quality of Service of Short Message Service in the Philippines , 2010, 2010 IEEE International Conference on Communication Systems.

[40] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[41] Bernard Manderick,et al. Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem , 2015, ESANN.

[42] Osunade Oluwaseyitanfunmi,et al. Route Optimization for Delivery of Short Message Service in Telecommunication Networks , 2015 .

[43] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[44] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..