Gambler Bandits and the Regret of Being Ruined

In this paper we consider a particular class of problems called multiarmed gambler bandits (MAGB) which constitutes a modified version of the Bernoulli MAB problem where two new elements must be taken into account: the budget and the risk of ruin. The agent has an initial budget that evolves in time following the received rewards, which can be either +1 after a success or −1 after a failure. The problem can also be seen as a MAB version of the classic gambler’s ruin game. The contribution of this paper is a preliminary analysis on the probability of being ruined given the current budget and observations, and the proposition of an alternative regret formulation, combining the classic regret notion with the expected loss due to the probability of being ruined. Finally, standard state-of-the-art methods are experimentally compared using the proposed metric.

[1]  S. Finch Gambler's Ruin , 2022 .

[2]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5]  Claire J. Tomlin,et al.  Budget-Constrained Multi-Armed Bandits with Multiple Plays , 2017, AAAI.

[6]  Marco Pavone,et al.  How Should a Robot Assess Risk? Towards an Axiomatic Theory of Risk in Robotics , 2017, ISRR.

[7]  Mathieu Bourgais,et al.  Open Problem: Risk of Ruin in Multiarmed Bandits , 2019, COLT.

[8]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[9]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[10]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[11]  Seongjoo Song,et al.  A Note on the History of the Gambler's Ruin Problem , 2013 .

[12]  Javier García,et al.  Teaching a humanoid robot to walk faster through Safe Reinforcement Learning , 2020, Eng. Appl. Artif. Intell..

[13]  Aurélien Garivier,et al.  KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints , 2018, J. Mach. Learn. Res..

[14]  Nenghai Yu,et al.  Finite budget analysis of multi-armed bandit problems , 2017, Neurocomputing.

[15]  Roi Livni,et al.  Multi-Armed Bandits with Metric Movement Costs , 2017, NIPS.

[16]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[17]  Sattar Vakili,et al.  Decision Variance in Risk-Averse Online Learning , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[18]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[19]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[20]  Filipo Studzinski Perotto Looking for the Right Time to Shift Strategy in the Exploration-exploitation Dilemma , 2015 .

[21]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[22]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[23]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[24]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[25]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[26]  Qiuyu Zhu,et al.  Thompson Sampling Algorithms for Mean-Variance Bandits , 2020, ICML.

[27]  Alessandro Lazaric,et al.  Conservative Exploration in Reinforcement Learning , 2020, AISTATS.

[28]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[29]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[30]  Hao Wang,et al.  Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising , 2018, CIKM.

[31]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[32]  Archie C. Chapman,et al.  Epsilon-First Policies for Budget-Limited Multi-Armed Bandits , 2010, AAAI.

[33]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[34]  Shie Mannor,et al.  A General Approach to Multi-Armed Bandits Under Risk Criteria , 2018, COLT.

[35]  Odalric-Ambrym Maillard,et al.  Robust Risk-Averse Stochastic Multi-armed Bandits , 2013, ALT.

[36]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[37]  Marco Pavone,et al.  Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[39]  Sudipto Guha,et al.  Graph Sparsification in the Semi-streaming Model , 2009, ICALP.

[40]  Qing Zhao,et al.  Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure , 2016, IEEE Journal of Selected Topics in Signal Processing.

[41]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.

[42]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[43]  Antoine Chambaz,et al.  Asymptotically optimal algorithms for budgeted multiple play bandits , 2016, Machine Learning.

[44]  Roi Livni,et al.  Bandits with Movement Costs and Adaptive Pricing , 2017, COLT.

[45]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[46]  Alessandro Lazaric,et al.  Improved Algorithms for Conservative Exploration in Bandits , 2020, AAAI.

[47]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[48]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[49]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[50]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[51]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[52]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[53]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.