Continuous-in-time Limit for Bayesian Bandits

This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.

[1]  R. Kohn,et al.  A New Approach to Drifting Games, Based on Asymptotically Optimal Potentials , 2022, ArXiv.

[2]  Vladimir A. Kobzar,et al.  A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit , 2022, ArXiv.

[3]  Peter W. Glynn,et al.  Diffusion Approximations for Thompson Sampling , 2021, ArXiv.

[4]  Lexing Ying,et al.  A Note on Optimization Formulations of Markov Decision Processes , 2020, 2012.09417.

[5]  Grant M. Rotskoff,et al.  A Dynamical Central Limit Theorem for Shallow Neural Networks , 2020, NeurIPS.

[6]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[7]  Albert Y. Zomaya,et al.  Partial Differential Equations , 2007, Explorations in Numerical Analysis.

[8]  Rene Caldentey,et al.  Diffusion Approximations for a Class of Sequential Experimentation Problems , 2019, Manag. Sci..

[9]  Yonatan Gur,et al.  Sequential Procurement with Contractual and Experimental Learning , 2019, Manag. Sci..

[10]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[11]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[12]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[13]  Yeon-Koo Che,et al.  Recommender Systems as Mechanisms for Social Learning , 2018 .

[14]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[15]  David Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017, Oper. Res..

[16]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[19]  Tor Lattimore,et al.  Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits , 2015, COLT.

[20]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[21]  Rong Zheng,et al.  Sequential Learning for Multi-Channel Wireless Network Monitoring With Channel Switching Costs , 2014, IEEE Transactions on Signal Processing.

[22]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[23]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[24]  Rina Panigrahy,et al.  Prediction strategies without loss , 2010, NIPS.

[25]  M. Mohri,et al.  Bandit Problems , 2006 .

[26]  N. Karoui,et al.  Optimal portfolio management with American capital guarantee , 2005 .

[27]  Benoît Leloup,et al.  Dynamic Pricing on the Internet: Theory and Simulations , 2001, Electron. Commer. Res..

[28]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[29]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[30]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[31]  R. N. Bradt,et al.  On Sequential Designs for Maximizing the Sum of $n$ Observations , 1956 .

[32]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[33]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[34]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[35]  Kuang Xu,et al.  Diffusion Asymptotics for Sequential Experiments , 2021, ArXiv.

[36]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[37]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[38]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[39]  Jati K. Sengupta,et al.  Stochastic Control Theory , 1997 .

[40]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[41]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .