Model-Free Non-Stationarity Detection and Adaptation in Reinforcement Learning

In most Reinforcement Learning (RL) studies, the considered task is assumed to be stationary, i.e., it does not change its behavior or its characteristics over time, as this allows to generate all the convergence properties of RL techniques. Unfortunately, this assumption does not hold in real-world scenarios where systems and environments typically evolve over time. For instance, in robotic applications, sensor or actuator faults would induce a sudden change in the RL settings, while in financial applications the evolution of the market can cause a more gradual variation over time. In this paper, we present an adaptive RL algorithm able to detect changes in the environment or in the reward function and react to these changes by adapting to the new conditions of the task. At first, we develop a figure of merit onto which a hypothesis test can be applied to detect changes between two different learning iterations. Then, we extended this test to sequentially operate over time by means of the CUmulative SUM (CUSUM) approach. Finally, the proposed changedetection mechanism is combined (following an adaptive-active approach) with a well known RL algorithm to make it able to deal with non-stationary tasks. We tested the proposed algorithm on two wellknown continuous-control tasks to check its effectiveness in terms of non-stationarity detection and adaptation over a vanilla RL algorithm.

[1]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Dit-Yan Yeung,et al.  An Environment Model for Nonstationary Reinforcement Learning , 1999, NIPS.

[4]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[5]  Manuel Roveri,et al.  Learning Discrete-Time Markov Chains Under Concept Drift , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[7]  J. Burbea The convexity with respect to Gaussian distributions of divergences of order a , 1984 .

[8]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[9]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[10]  Peter Auer,et al.  A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions , 2018, ArXiv.

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[15]  Ronald Ortner,et al.  Variational Regret Bounds for Reinforcement Learning , 2019, UAI.

[16]  Emmanuel Hadoux,et al.  Sequential Decision-Making under Non-stationary Environments via Sequential Change-point Detection , 2014 .

[17]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[18]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[19]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[20]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[21]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[22]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[23]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[24]  Paulo Martins Engel,et al.  Dealing with non-stationary environments using context detection , 2006, ICML.

[25]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.