Multi-armed Bandits with Cost Subsidy

In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by a number of businesses (especially online platforms) and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish fundamental lower bound of $\Omega(K^{1/3} T^{2/3})$ on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem (where $T$ is the time horizon and $K$ is the number of arms). We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms.

[1]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2]  Songwu Lu,et al.  Analysis of the Reliability of a Nationwide Short Message Service , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[3]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[4]  Jian Li,et al.  Pure Exploration of Multi-armed Bandit Under Matroid Constraints , 2016, COLT.

[5]  George L. O'Brien,et al.  A Bernoulli factory , 1994, TOMC.

[6]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[7]  Kirthevasan Kandasamy,et al.  A Flexible Framework for Multi-Objective Bayesian Optimization using Random Scalarizations , 2018, UAI.

[8]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[10]  Vashist Avadhanula,et al.  A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[11]  Claudio Gentile,et al.  Ieee Transactions on Information Theory 1 Regret Minimization for Reserve Prices in Second-price Auctions , 2022 .

[12]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[13]  Sanjay Shakkottai,et al.  Social Learning in Multi Agent Multi Armed Bandits , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[14]  Bernard Manderick,et al.  Annealing-pareto multi-objective multi-armed bandit algorithm , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[15]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[16]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[17]  Mark Huber Nearly Optimal Bernoulli Factories for Linear Functions , 2016, Comb. Probab. Comput..

[18]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[19]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[20]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  Wei Cao,et al.  On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[23]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[24]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[25]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[26]  Aleksandrs Slivkins,et al.  Online decision making in crowdsourcing markets: theoretical challenges , 2013, SECO.

[27]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[28]  Bernard Manderick,et al.  Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem , 2015, ESANN.

[29]  Marnelli Canlas,et al.  A quantitative analysis of the Quality of Service of Short Message Service in the Philippines , 2010, 2010 IEEE International Conference on Communication Systems.

[30]  Songwu Lu,et al.  A study of the short message service of a nationwide cellular network , 2006, IMC '06.

[31]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[32]  Ann Nowé,et al.  Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[33]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[34]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[35]  Christos Thrampoulidis,et al.  Generalized Linear Bandits with Safety Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[37]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[38]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[39]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[40]  Samuel Daulton,et al.  Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints , 2019, ArXiv.

[41]  Nicole Immorlica,et al.  Adversarial Bandits with Knapsacks , 2018, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[42]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.