论文信息 - BelMan: Bayesian Bandits on the Belief-Reward Manifold - 字舞流文

BelMan: Bayesian Bandits on the Belief-Reward Manifold

We propose a generic, Bayesian, information geometric approach to the exploration--exploitation trade-off in multi-armed bandit problems. Our approach, BelMan, uniformly supports pure exploration, exploration--exploitation, and two-phase bandit problems. The knowledge on bandit arms and their reward distributions is summarised by the barycentre of the joint distributions of beliefs and rewards of the arms, the \emph{pseudobelief-reward}, within the beliefs-rewards manifold. BelMan alternates \emph{information projection} and \emph{reverse information projection}, i.e., projection of the pseudobelief-reward onto beliefs-rewards to choose the arm to play, and projection of the resulting beliefs-rewards onto the pseudobelief-reward. It introduces a mechanism that infuses an exploitative bias by means of a \emph{focal distribution}, i.e., a reward distribution that gradually concentrates on higher rewards. Comparative performance evaluation with state-of-the-art algorithms shows that BelMan is not only competitive but can also outperform other approaches in specific setups, for instance involving many arms and continuous rewards.

Stéphane Bressan | Pierre Senellart | Debabrota Basu | P. Senellart | D. Basu | S. Bressan

[1] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[2] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3] L. Wasserman,et al. Rates of convergence of posterior distributions , 2001 .

[4] E. Kaufmann. On Bayesian index policies for sequential resource allocation , 2016, 1601.01190.

[5] M. Degroot. Optimal Statistical Decisions , 1970 .

[6] Guillaume Carlier,et al. Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[7] Tze Leung Lai,et al. Asymptotic Solutions of Bandit Problems , 1988 .

[8] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[9] Akimichi Takemura,et al. An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[10] Aurélien Garivier,et al. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[11] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[12] R. Durrett. Probability: Theory and Examples , 1993 .

[13] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[14] Pierre Senellart,et al. Adaptive Web Crawling Through Structure-Based Link Classification , 2015, ICADL.

[15] S. Kullback,et al. Information Theory and Statistics , 1959 .

[16] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[17] Edwin T. Jaynes. Prior Probabilities , 2010, Encyclopedia of Machine Learning.

[18] Theja Tulabandhula,et al. Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes , 2017, AAMAS.

[19] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[20] L. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[21] José Niño-Mora,et al. Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[22] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[23] Alessandro Lazaric,et al. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[24] Shun-ichi Amari,et al. Methods of information geometry , 2000 .

[25] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[26] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[27] Aurélien Garivier,et al. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[28] J. Bernardo,et al. Psi (Digamma) Function , 1976 .

[29] Long Tran-Thanh,et al. Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[30] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[31] R. Bellman. A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[32] Sanjay Shakkottai,et al. Regret of Queueing Bandits , 2016, NIPS.

[33] Wei Chen,et al. Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[34] Sébastien Bubeck,et al. Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[35] I. Csiszár. Sanov Property, Generalized $I$-Projection and a Conditional Limit Theorem , 1984 .

[36] Chien-Ju Ho,et al. Online Task Assignment in Crowdsourcing Markets , 2012, AAAI.

[37] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38] Daniel Russo,et al. Simple Bayesian Algorithms for Best Arm Identification , 2016, COLT.

[39] R. Munos,et al. Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[40] B. O. Koopman. On distributions admitting a sufficient statistic , 1936 .

[41] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[42] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[43] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[44] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[45] W. Wong,et al. Probability inequalities for likelihood ratios and convergence rates of sieve MLEs , 1995 .

[46] Shivaram Kalyanakrishnan,et al. Information Complexity in Bandit Subset Selection , 2013, COLT.