Cost-Aware Cascading Bandits

In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an <italic>ordered</italic> list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected <italic>net reward</italic> in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We first consider the setting where the instantaneous cost of pulling an arm is unknown to the learner until it has been pulled. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in <inline-formula><tex-math notation="LaTeX">$O(\log T)$</tex-math></inline-formula>. We also provide a lower bound for all <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>-consistent policies, which scales in <inline-formula><tex-math notation="LaTeX">$\Omega (\log T)$</tex-math></inline-formula> and matches our upper bound. We then investigate the setting where the instantaneous cost of pulling each arm is available to the learner for its decision-making, and show that a slight modification of the CC-UCB algorithm, termed as CC-UCB2, is order-optimal. The performances of the algorithms are evaluated with both synthetic and real-world data.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[3]  Vahid Tarokh,et al.  On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits , 2016, IEEE Transactions on Signal Processing.

[4]  Archie C. Chapman,et al.  Epsilon-First Policies for Budget-Limited Multi-Armed Bandits , 2010, AAAI.

[5]  A. Burnetas,et al.  ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT , 2015, Probability in the Engineering and Informational Sciences.

[6]  Matthew J. Streeter,et al.  An Online Algorithm for Maximizing Submodular Functions , 2008, NIPS.

[7]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[8]  Nenghai Yu,et al.  Budgeted Multi-Armed Bandits with Multiple Plays , 2016, IJCAI.

[9]  Alessandro Lazaric,et al.  Risk-Aversion in Multi-armed Bandits , 2012, NIPS.

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Archie C. Chapman,et al.  ε-first policies for budget-limited multi-armed bandits , 2010, AAAI 2010.

[12]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[13]  Xin Liu,et al.  Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits , 2017, ICML.

[14]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[15]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[16]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[17]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[18]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[19]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[20]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[21]  Apostolos Burnetas,et al.  Adaptive Policies for Sequential Sampling under Incomplete Information and a Cost Constraint , 2012, ArXiv.

[22]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[23]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[24]  Qing Zhao,et al.  Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure , 2016, IEEE Journal of Selected Topics in Signal Processing.

[25]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[26]  Cong Shen,et al.  Cost-Aware Cascading Bandits , 2018, IEEE Transactions on Signal Processing.

[27]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[28]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[29]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[30]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[31]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[32]  Cong Shen Universal Best Arm Identification , 2019, IEEE Transactions on Signal Processing.