Optimal Thompson Sampling strategies for support-aware CVaR bandits

In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level α of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou and Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.

[1]  M. Tollenaar,et al.  Yield potential, yield stability and stress tolerance in maize , 2002 .

[2]  D. Tasche,et al.  On the coherence of expected shortfall , 2001, cond-mat/0104295.

[3]  Senthold Asseng,et al.  The DSSAT crop modeling ecosystem , 2019 .

[4]  R. L. McCown,et al.  Changing systems for supporting farmers' decisions: problems, paradigms, and prospects , 2002 .

[5]  L. T. Evans,et al.  Yield potential: its definition, measurement, and significance , 1999 .

[6]  Philip S. Thomas,et al.  Concentration Inequalities for Conditional Value at Risk , 2019, ICML.

[7]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[8]  Eyke Hüllermeier,et al.  Qualitative Multi-Armed Bandits: A Quantile-Based Approach , 2015, ICML.

[9]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[10]  Wouter M. Koolen,et al.  Optimal Best-Arm Identification Methods for Tail-Risk Measures , 2020, NeurIPS.

[11]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[12]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[13]  C. Berge Topological Spaces: including a treatment of multi-valued functions , 2010 .

[14]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[15]  C. W. Richardson Wgen: A Model for Generating Daily Weather Variables , 2018 .

[16]  Matthew J. Holland,et al.  Learning with CVaR-based feedback under potentially heavy tails , 2020, ArXiv.

[17]  B. Mandlebrot The Variation of Certain Speculative Prices , 1963 .

[18]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[19]  David B. Brown,et al.  Large deviations bounds for estimating conditional value-at-risk , 2007, Oper. Res. Lett..

[20]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[21]  Krishna Jagannathan,et al.  Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards , 2019, NeurIPS.

[22]  Akimichi Takemura,et al.  Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[23]  Michal Valko,et al.  Extreme bandits , 2014, NIPS.

[24]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[25]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[26]  Emma Brunskill,et al.  Distributionally-Aware Exploration for CVaR Bandits , 2019 .

[27]  Junya Honda,et al.  Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions , 2020, ALT.

[28]  Qing Zhao,et al.  Mean-variance and value at risk in multi-armed bandit problems , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[30]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[31]  Quirino Paris,et al.  The Return of von Liebig's “Law of the Minimum” , 1992 .

[32]  Qiuyu Zhu,et al.  Thompson Sampling Algorithms for Mean-Variance Bandits , 2020, ICML.

[33]  Shie Mannor,et al.  A General Approach to Multi-Armed Bandits Under Risk Criteria , 2018, COLT.

[34]  Odalric-Ambrym Maillard,et al.  Robust Risk-Averse Stochastic Multi-armed Bandits , 2013, ALT.

[35]  Qing Zhao,et al.  Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure , 2016, IEEE Journal of Selected Topics in Signal Processing.

[36]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[37]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[38]  Advances in crop modelling for a sustainable agriculture , 2019 .

[39]  Krishna Jagannathan,et al.  Constrained regret minimization for multi-criterion multi-armed bandits , 2020, ArXiv.

[40]  Tor Lattimore,et al.  A Scale Free Algorithm for Stochastic Bandits with Bounded Kurtosis , 2017, NIPS.

[41]  Krishnendu Chatterjee,et al.  Generalized Risk-Aversion in Stochastic Multi-Armed Bandits , 2014, ArXiv.

[42]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[43]  Krishna P. Jagannathan,et al.  Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions , 2019, ICML.

[44]  P. Carberry,et al.  Emerging consensus on desirable characteristics of tools to support farmers' management of climate risk in Australia , 2011 .

[45]  Byeong Ho Kang,et al.  From Data to Decisions: Helping Crop Producers Build Their Actionable Knowledge , 2017 .

[46]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..