Regime Switching Bandits

We study a multi-armed bandit problem where the rewards exhibit regime-switching. Specifically, the distributions of the random rewards generated from all arms depend on a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the unknown transition probability matrix as well as the reward distribution. We propose an efficient learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models and upper confidence bound methods for reinforcement learning. We also establish $O(T^{2/3}\sqrt{\log T})$ bound on the regret of the proposed learning algorithm where $T$ is the unknown horizon. Finally, we conduct numerical experiments to illustrate the effectiveness of the learning algorithm.

[1]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[2]  Shie Mannor,et al.  Rotting Bandits , 2017, NIPS.

[3]  Shie Mannor,et al.  Latent Bandits , 2014, ICML.

[4]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[5]  Yossi Aviv,et al.  A Partially Observed Markov Decision Process for Dynamic Pricing , 2005, Manag. Sci..

[6]  Alessandro Lazaric,et al.  Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes , 2018, ArXiv.

[7]  Haikady N. Nagaraja,et al.  Inference in Hidden Markov Models , 2006, Technometrics.

[8]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[9]  Maria L. Gini,et al.  Detecting and Forecasting Economic Regimes in Multi-Agent Automated Exchanges , 2007, Decis. Support Syst..

[10]  Lai Wei,et al.  On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems , 2018, 2018 Annual American Control Conference (ACC).

[11]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[12]  Ronald Ortner,et al.  Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2015, ICML.

[13]  Lillian J. Ratliff,et al.  Multi-Armed Bandits for Correlated Markovian Environments with Smoothed Reward Feedback , 2018, 1803.04008.

[14]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[15]  Tor Lattimore,et al.  Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[16]  A. V. den Boer,et al.  Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions , 2013 .

[17]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[18]  Samarth Gupta,et al.  Correlated Multi-Armed Bandits with A Latent Random Source , 2018, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[20]  Assaf J. Zeevi,et al.  Chasing Demand: Learning and Earning in a Changing Environment , 2016, Math. Oper. Res..

[21]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[22]  David Simchi-Levi,et al.  Learning to Optimize under Non-Stationarity , 2018, AISTATS.

[23]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[24]  Luc Leh'ericy,et al.  Consistent order estimation for nonparametric hidden Markov models , 2019, Bernoulli.

[25]  Rogemar S. Mamon,et al.  Hidden Markov Models In Finance , 2007 .

[26]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[27]  Karl Hinderer,et al.  Lipschitz Continuity of Value Functions in Markovian Decision Processes , 2005, Math. Methods Oper. Res..

[28]  Anelia Somekh-Baruch,et al.  Restless Hidden Markov Bandit with Linear Rewards , 2020, 2020 59th IEEE Conference on Decision and Control (CDC).

[29]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[30]  Peng Shi,et al.  Approximation algorithms for restless bandit problems , 2007, JACM.

[31]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[32]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[33]  Djallel Bouneffouf,et al.  A Survey on Practical Applications of Multi-Armed and Contextual Bandits , 2019, ArXiv.

[34]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[35]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[37]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[38]  Yohann De Castro,et al.  Consistent Estimation of the Filtering and Marginal Smoothing Distributions in Nonparametric Hidden Markov Models , 2015, IEEE Transactions on Information Theory.

[39]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..