BelMan: An Information-Geometric Approach to Stochastic Bandits

We propose a Bayesian information-geometric approach to the exploration–exploitation trade-off in stochastic multi-armed bandits. The uncertainty on reward generation and belief is represented using the manifold of joint distributions of rewards and beliefs. Accumulated information is summarised by the barycentre of joint distributions, the pseudobelief-reward. While the pseudobelief-reward facilitates information accumulation through exploration, another mechanism is needed to increase exploitation by gradually focusing on higher rewards, the pseudobelief-focal-reward. Our resulting algorithm, BelMan, alternates between projection of the pseudobelief-focal-reward onto belief-reward distributions to choose the arm to play, and projection of the updated belief-reward distributions onto the pseudobelief-focal-reward. We theoretically prove BelMan to be asymptotically optimal and to incur a sublinear regret growth. We instantiate BelMan to stochastic bandits with Bernoulli and exponential rewards, and to a real-life application of scheduling queueing bandits. Comparative evaluation with the state of the art shows that BelMan is not only competitive for Bernoulli bandits but in many cases also outperforms other approaches for exponential and queueing bandits.

[1]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[2]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[5]  Pierre Senellart,et al.  Adaptive Web Crawling Through Structure-Based Link Classification , 2015, ICADL.

[6]  L. Wasserman,et al.  Rates of convergence of posterior distributions , 2001 .

[7]  E. Kaufmann On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[8]  R. Durrett Probability: Theory and Examples , 1993 .

[9]  José Niòo-Mora Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[10]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[11]  Edwin T. Jaynes Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[12]  Theja Tulabandhula,et al.  Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes , 2017, AAMAS.

[13]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[15]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[16]  M. Degroot Optimal Statistical Decisions , 1970 .

[17]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[18]  Daniel Russo,et al.  Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[19]  W. Wong,et al.  Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Sanjay Shakkottai,et al.  Regret of Queueing Bandits , 2016, NIPS.

[22]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[23]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[24]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[25]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[26]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[27]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[28]  Guillaume Carlier,et al.  Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[29]  Tor Lattimore,et al.  On Explore-Then-Commit strategies , 2016, NIPS.

[30]  I. Csiszár Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[31]  F. Barbaresco Information Geometry of Covariance Matrix: Cartan-Siegel Homogeneous Bounded Domains, Mostow/Berger Fibration and Fréchet Median , 2013 .

[32]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[33]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[34]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .