Cost-Aware Cascading Bandits

In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed ban- dits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to max- imize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost in- curred in examining the items, by deciding the or- dered list of items, as well as when to stop examina- tion. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the of- fline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cas- cading Upper Confidence Bound (CC-UCB) algo- rithm, and show that the cumulative regret scales in O(log T ). We also provide a lower bound for all {\alpha}-consistent policies, which scales in {\Omega}(log T ) and matches our upper bound. The performance of the CC-UCB algorithm is evaluated with both synthetic and real-world data.

[1]  Nicholas Daras,et al.  Applications of Mathematics and Informatics in Military Science , 2014 .

[2]  Archie C. Chapman,et al.  Epsilon-First Policies for Budget-Limited Multi-Armed Bandits , 2010, AAAI.

[3]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[4]  Fariborz Maseeh,et al.  Some New Applications of P-P Plots , 2018 .

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[7]  Steve Hanneke,et al.  Proceedings of the 28th International Conference on Algorithmic Learning Theory , 2017 .

[8]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[9]  Nenghai Yu,et al.  Budgeted Multi-Armed Bandits with Multiple Plays , 2016, IJCAI.

[10]  A. Appendix Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015 .

[11]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[12]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[13]  Thomas L. Griffiths,et al.  Advances in Neural Information Processing Systems 21 , 1993, NIPS 2009.

[14]  Jeroen Keppens,et al.  Proceedings of the 9th International Conference on Artificial Intelligence and Law , 2003 .

[15]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[16]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.