Markov Decision Processes with Time-varying Transition Probabilities and Rewards

We consider online Markov decision process (MDP) problems where both the transition probabilities and the rewards are time-varying or even adversarially generated. We propose an online algorithm based on an online implementation of value iterations and show that its dynamic regret, i.e. its total reward compared with that of the optimal (nonstationary) policies in hindsight, is upper bounded by the total variation of the transition probabilities and the rewards. Moreover, we show that the dynamic regret of any online algorithm is lower bounded by the total variation of the transition probabilities and the rewards, indicating that the proposed algorithm is optimal up to a constant factor. Finally, we test our algorithm in a power management problem for a data center and show that our algorithm reduces energy costs and ensures quality of services (QoS) under real electricity prices and job arrival rates.

[1]  Wouter M. Koolen,et al.  Putting Bayes to sleep , 2012, NIPS.

[2]  D. Bertsekas A New Value Iteration method for the Average Cost Dynamic Programming Problem , 1998 .

[3]  Peter L. Bartlett,et al.  Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions , 2013, NIPS.

[4]  András György,et al.  Online Learning in Markov Decision Processes with Changing Cost Sequences , 2014, ICML.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[7]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[8]  Masashi Sugiyama,et al.  Online Markov decision processes with policy iteration , 2015, ArXiv.

[9]  Rong Jin,et al.  Strongly Adaptive Regret Implies Optimally Dynamic Regret , 2017, ArXiv.

[10]  Vicenç Gómez,et al.  Fast rates for online learning in Linearly Solvable Markov Decision Processes , 2017, COLT.

[11]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[12]  Shahin Shahrampour,et al.  Online Optimization : Competing with Dynamic Comparators , 2015, AISTATS.

[13]  CARL D. MEYER,et al.  The Condition of a Finite Markov Chain and Perturbation Bounds for the Limiting Probabilities , 1980, SIAM J. Algebraic Discret. Methods.

[14]  Rebecca Willett,et al.  Dynamical Models and tracking regret in online convex programming , 2013, ICML.

[15]  Aryan Mokhtari,et al.  Optimization in Dynamic Environments : Improved Regret Rates for Strongly Convex Problems , 2016 .

[16]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[17]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[18]  Marziyeh Bayati,et al.  Power Management Policy for Heterogeneous Data Center Based on Histogram and Discrete-Time MDP , 2018, PASM.

[19]  Haipeng Luo,et al.  Efficient Contextual Bandits in Non-stationary Worlds , 2017, COLT.

[20]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[21]  Rebecca Willett,et al.  Online Convex Optimization in Dynamic Environments , 2015, IEEE Journal of Selected Topics in Signal Processing.

[22]  Patrick Jaillet,et al.  Online Spatio-Temporal Matching in Stochastic and Dynamic Domains , 2016, AAAI.

[23]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[24]  J. Meyer The Role of the Group Generalized Inverse in the Theory of Finite Markov Chains , 1975 .

[25]  Omar Besbes,et al.  Non-Stationary Stochastic Optimization , 2013, Oper. Res..

[26]  Seshadhri Comandur,et al.  Electronic Colloquium on Computational Complexity, Report No. 88 (2007) Adaptive Algorithms for Online Decision Problems , 2022 .

[27]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[28]  Maja J. Mataric,et al.  Maximizing Reward in a Non-Stationary Mobile Robot Environment , 2003, Autonomous Agents and Multi-Agent Systems.

[29]  Agathoniki Trigoni,et al.  Supporting Search and Rescue Operations with UAVs , 2010, 2010 International Conference on Emerging Security Technologies.

[30]  Marcus Hutter,et al.  Adaptive Online Prediction by Following the Perturbed Leader , 2005, J. Mach. Learn. Res..

[31]  Amit Daniely,et al.  Strongly Adaptive Online Learning , 2015, ICML.

[32]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[33]  Ryan N. Smith,et al.  Wind-energy based path planning for Unmanned Aerial Vehicles using Markov Decision Processes , 2013, 2013 IEEE International Conference on Robotics and Automation.

[34]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[35]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[36]  Na Li,et al.  Online Optimization With Predictions and Switching Costs: Fast Algorithms and the Fundamental Limit , 2018, IEEE Transactions on Automatic Control.

[37]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[38]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[39]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[40]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[41]  C. D. Meyer Sensitivity of the Stationary Distribution of a Markov Chain , 1994, SIAM J. Matrix Anal. Appl..

[42]  Xiaohan Wei,et al.  Online Learning in Weakly Coupled Markov Decision Processes , 2017, PERV.

[43]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[44]  Mor Harchol-Balter,et al.  How data center size impacts the effectiveness of dynamic power management , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[45]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.