论文信息 - BelMan: An Information-Geometric Approach to Stochastic Bandits - 字舞流文

BelMan: An Information-Geometric Approach to Stochastic Bandits

We propose a Bayesian information-geometric approach to the exploration–exploitation trade-off in stochastic multi-armed bandits. The uncertainty on reward generation and belief is represented using the manifold of joint distributions of rewards and beliefs. Accumulated information is summarised by the barycentre of joint distributions, the pseudobelief-reward. While the pseudobelief-reward facilitates information accumulation through exploration, another mechanism is needed to increase exploitation by gradually focusing on higher rewards, the pseudobelief-focal-reward. Our resulting algorithm, BelMan, alternates between projection of the pseudobelief-focal-reward onto belief-reward distributions to choose the arm to play, and projection of the updated belief-reward distributions onto the pseudobelief-focal-reward. We theoretically prove BelMan to be asymptotically optimal and to incur a sublinear regret growth. We instantiate BelMan to stochastic bandits with Bernoulli and exponential rewards, and to a real-life application of scheduling queueing bandits. Comparative evaluation with the state of the art shows that BelMan is not only competitive for Bernoulli bandits but in many cases also outperforms other approaches for exponential and queueing bandits.

Stéphane Bressan | Pierre Senellart | Debabrota Basu | P. Senellart | D. Basu | S. Bressan

[1] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[2] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[3] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4] Aurélien Garivier,et al. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[5] Pierre Senellart,et al. Adaptive Web Crawling Through Structure-Based Link Classification , 2015, ICADL.

[6] L. Wasserman,et al. Rates of convergence of posterior distributions , 2001 .

[7] E. Kaufmann. On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[8] R. Durrett. Probability: Theory and Examples , 1993 .

[9] José Niòo-Mora. Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[10] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[11] Edwin T. Jaynes. Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[12] Theja Tulabandhula,et al. Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes , 2017, AAMAS.

[13] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[15] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[16] M. Degroot. Optimal Statistical Decisions , 1970 .

[17] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[18] Daniel Russo,et al. Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[19] W. Wong,et al. Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[20] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21] Sanjay Shakkottai,et al. Regret of Queueing Bandits , 2016, NIPS.

[22] Shivaram Kalyanakrishnan,et al. Information Complexity in Bandit Subset Selection , 2013, COLT.

[23] Long Tran-Thanh,et al. Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[24] Solomon Kullback,et al. Information Theory and Statistics , 1960 .

[25] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[26] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[27] L. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[28] Guillaume Carlier,et al. Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[29] Tor Lattimore,et al. On Explore-Then-Commit strategies , 2016, NIPS.

[30] I. Csiszár. Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[31] F. Barbaresco. Information Geometry of Covariance Matrix: Cartan-Siegel Homogeneous Bounded Domains, Mostow/Berger Fibration and Fréchet Median , 2013 .

[32] Aurélien Garivier,et al. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[33] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[34] R. Bellman. A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .