BelMan: Bayesian Bandits on the Belief-Reward Manifold

We propose a generic, Bayesian, information geometric approach to the exploration--exploitation trade-off in multi-armed bandit problems. Our approach, BelMan, uniformly supports pure exploration, exploration--exploitation, and two-phase bandit problems. The knowledge on bandit arms and their reward distributions is summarised by the barycentre of the joint distributions of beliefs and rewards of the arms, the \emph{pseudobelief-reward}, within the beliefs-rewards manifold. BelMan alternates \emph{information projection} and \emph{reverse information projection}, i.e., projection of the pseudobelief-reward onto beliefs-rewards to choose the arm to play, and projection of the resulting beliefs-rewards onto the pseudobelief-reward. It introduces a mechanism that infuses an exploitative bias by means of a \emph{focal distribution}, i.e., a reward distribution that gradually concentrates on higher rewards. Comparative performance evaluation with state-of-the-art algorithms shows that BelMan is not only competitive but can also outperform other approaches in specific setups, for instance involving many arms and continuous rewards.

[1]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  L. Wasserman,et al.  Rates of convergence of posterior distributions , 2001 .

[4]  E. Kaufmann On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[5]  M. Degroot Optimal Statistical Decisions , 1970 .

[6]  Guillaume Carlier,et al.  Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[7]  Tze Leung Lai,et al.  Asymptotic Solutions of Bandit Problems , 1988 .

[8]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[9]  Akimichi Takemura,et al.  An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[10]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[11]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[12]  R. Durrett Probability: Theory and Examples , 1993 .

[13]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[14]  Pierre Senellart,et al.  Adaptive Web Crawling Through Structure-Based Link Classification , 2015, ICADL.

[15]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[16]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[17]  Edwin T. Jaynes Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[18]  Theja Tulabandhula,et al.  Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes , 2017, AAMAS.

[19]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[20]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[21]  José Niño-Mora,et al.  Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[24]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[25]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[26]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[27]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[28]  J. Bernardo,et al.  Psi (Digamma) Function , 1976 .

[29]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[30]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[31]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[32]  Sanjay Shakkottai,et al.  Regret of Queueing Bandits , 2016, NIPS.

[33]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[34]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[35]  I. Csiszár Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[36]  Chien-Ju Ho,et al.  Online Task Assignment in Crowdsourcing Markets , 2012, AAAI.

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[39]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[40]  B. O. Koopman On distributions admitting a sufficient statistic , 1936 .

[41]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[42]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[43]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[44]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[45]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[46]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.