A Bit Better? Quantifying Information for Bandit Learning

The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation. Originally, this was defined to be the ratio between squared expected regret and the mutual information between the environment and action-observation pair, which represents a measure of information gain. Recent work has inspired consideration of alternative information measures, particularly for use in analysis of bandit learning algorithms to arrive at tighter regret bounds. We investigate whether quantification of information via such alternatives can improve the realized performance of information-directed sampling, which aims to minimize the information ratio. Acknowledgement: Financial support from Army Research Office (ARO) grant W911NF2010055 is gratefully acknowledged.

[1]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[2]  Yuval Peres,et al.  Bandit Convex Optimization: \(\sqrt{T}\) Regret in One Dimension , 2015, COLT.

[3]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[4]  Tor Lattimore,et al.  Mirror Descent and the Information Ratio , 2020, ArXiv.

[5]  Shi Dong,et al.  An Information-Theoretic Analysis for Thompson Sampling with Many Actions , 2018, NeurIPS.

[6]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[7]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[8]  Tor Lattimore,et al.  Exploration by Optimisation in Partial Monitoring , 2019, COLT.

[9]  Andreas Krause,et al.  Information Directed Sampling and Bandits with Heteroscedastic Noise , 2018, COLT.

[10]  Tor Lattimore,et al.  An Information-Theoretic Approach to Minimax Regret in Partial Monitoring , 2019, COLT.

[11]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[12]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[13]  Benjamin Van Roy,et al.  Satisficing in Time-Sensitive Bandit Learning , 2018, Math. Oper. Res..

[14]  Tor Lattimore,et al.  Information Directed Sampling for Linear Partial Monitoring , 2020, COLT.

[15]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[16]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[17]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[18]  Benjamin Van Roy,et al.  Information-Theoretic Confidence Bounds for Reinforcement Learning , 2019, NeurIPS.

[19]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[20]  Julian Zimmert,et al.  Connections Between Mirror Descent, Thompson Sampling and the Information Ratio , 2019, NeurIPS.

[21]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[22]  Tor Lattimore,et al.  Improved Regret for Zeroth-Order Adversarial Bandit Convex Optimisation , 2020, ArXiv.

[23]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.