Optimizing for the Future in Non-Stationary MDPs

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.

[1]  Peter L. Bartlett,et al.  Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions , 2013, NIPS.

[2]  Chelsea Finn,et al.  Deep Reinforcement Learning amidst Lifelong Non-Stationarity , 2020, ArXiv.

[3]  Sindhu Padakandla A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments , 2020, ArXiv.

[4]  Robert L. Smith,et al.  A Linear Programming Approach to Nonstationary Infinite-Horizon Markov Decision Processes , 2013, Oper. Res..

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Ufuk Topcu,et al.  Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation , 2019, J. Mach. Learn. Res..

[7]  Sherief Abdallah,et al.  Addressing Environment Non-Stationarity by Repeating Q-learning Updates , 2016, J. Mach. Learn. Res..

[8]  M. de Rijke,et al.  When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments , 2019, WSDM.

[9]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Peter Auer,et al.  A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions , 2018, ArXiv.

[12]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[13]  M. de Rijke,et al.  Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade Model , 2019, IJCAI.

[14]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[15]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[16]  Victor R. Lesser,et al.  Multi-Agent Learning with Policy Prediction , 2010, AAAI.

[17]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[18]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[19]  Mehryar Mohri,et al.  Optimistic Bandit Convex Optimization , 2016, NIPS.

[20]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[21]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[22]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[23]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[24]  Pieter Abbeel,et al.  Adaptive Online Planning for Continual Lifelong Learning , 2019, ArXiv.

[25]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[26]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[27]  Masashi Sugiyama,et al.  Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition , 2012, Neurocomputing.

[28]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[29]  Naoyuki Kubota,et al.  Reinforcement Learning in non-stationary environments: An intrinsically motivated stress based memory retrieval performance (SBMRP) model , 2014, 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[30]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[31]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[32]  Guannan Qu,et al.  Markov Decision Processes with Time-varying Transition Probabilities and Rewards , 2019 .

[33]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[34]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[35]  Robert L. Smith,et al.  Solution and Forecast Horizons for Infinite-Horizon Nonhomogeneous Markov Decision Processes , 2007, Math. Oper. Res..

[36]  M. M. Hassan Mahmud,et al.  Learning in non-stationary MDPs as transfer learning , 2013, AAMAS.

[37]  Mehryar Mohri,et al.  Accelerating Online Convex Optimization via Adaptive Prediction , 2016, AISTATS.

[38]  David Simchi-Levi,et al.  Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism , 2019 .

[39]  Nolan Wagener,et al.  An Online Learning Approach to Model Predictive Control , 2019, Robotics: Science and Systems.

[40]  Jack Cuzick,et al.  A strong law for weighted sums of i.i.d. random variables , 1995 .

[41]  Robert L. Smith,et al.  Solving Nonstationary Infinite Horizon Dynamic Optimization Problems , 2000 .

[42]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[43]  C. Cobelli,et al.  The UVA/PADOVA Type 1 Diabetes Simulator , 2014, Journal of diabetes science and technology.

[44]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[45]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[46]  Meysam Bastani,et al.  Model-Free Intelligent Diabetes Management Using Machine Learning , 2014 .

[47]  Jun-Kun Wang,et al.  Optimistic Adaptive Acceleration for Optimization , 2019, ArXiv.

[48]  Dit-Yan Yeung,et al.  An Environment Model for Nonstationary Reinforcement Learning , 1999, NIPS.

[49]  Peter Stone,et al.  Bayesian Models of Nonstationary Markov Decision Processes , 2005 .

[50]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[51]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[52]  Erwan Lecarpentier,et al.  Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning , 2019, NeurIPS.

[53]  Sergey Levine,et al.  Online Meta-Learning , 2019, ICML.

[54]  Martha White,et al.  Meta-descent for Online, Continual Prediction , 2019, AAAI.

[55]  Robert L. Smith,et al.  A New Optimality Criterion for Nonhomogeneous Markov Decision Processes , 1987, Oper. Res..

[56]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[57]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[58]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[59]  Zhizhen Zhao,et al.  Be Aware of Non-Stationarity: Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits , 2019, ArXiv.

[60]  Shie Mannor,et al.  Rotting Bandits , 2017, NIPS.

[61]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[62]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[63]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[64]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[65]  Sebastian Thrun,et al.  Lifelong Learning Algorithms , 1998, Learning to Learn.

[66]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[67]  Eduardo W. Basso,et al.  Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios , 2009 .

[68]  Patrick M. Pilarski,et al.  TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent , 2018, ArXiv.

[69]  David Simchi-Levi,et al.  Reinforcement Learning under Drift , 2019, ArXiv.

[70]  Jonathan L. Shapiro,et al.  Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games , 2013, ICAISC.

[71]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[72]  Alessandro Lazaric,et al.  Rotting bandits are no harder than stochastic ones , 2018, AISTATS.

[73]  Corso Elvezia A General Method for Incremental Self-improvement and Multi-agent Learning in Unrestricted Environments , 1996 .

[74]  Larry D. Pyeatt,et al.  Reinforcement learning for closed-loop propofol anesthesia: a study in human volunteers , 2014, J. Mach. Learn. Res..

[75]  Archis Ghate,et al.  Policy iteration for robust nonstationary Markov decision processes , 2016, Optim. Lett..