Lifelong Hyper-Policy Optimization with Multiple Importance Sampling Regularization

Learning in a lifelong setting, where the dynamics continually evolve, is a hard challenge for current reinforcement learning algorithms. Yet this would be a much needed feature for practical applications. In this paper, we propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time. This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling, at the cost of introducing a controlled bias. We combine the future performance estimate with the past performance to mitigate catastrophic forgetting. To avoid overfitting the collected data, we derive a differentiable variance bound that we embed as a penalization term. Finally, we empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments, including water resource management and trading.

[1]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[2]  Bruce Lee Bowerman,et al.  Nonstationary Markov decision processes and related topics in nonstationary Markov chains , 1974 .

[3]  Marcello Restelli,et al.  Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[6]  Marcello Restelli,et al.  Tree‐based reinforcement learning for optimal water reservoir operation , 2010 .

[7]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[8]  Qiang Yang,et al.  Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[9]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[10]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11]  Sindhu Padakandla,et al.  A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments , 2020, ACM Comput. Surv..

[12]  Robert L. Smith,et al.  A Linear Programming Approach to Nonstationary Infinite-Horizon Markov Decision Processes , 2013, Oper. Res..

[13]  Falk Lieder,et al.  Doing more with less: meta-reasoning and meta-learning in humans and machines , 2019, Current Opinion in Behavioral Sciences.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Leonidas J. Guibas,et al.  Optimally combining sampling techniques for Monte Carlo rendering , 1995, SIGGRAPH.

[16]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[17]  Bruno Scherrer,et al.  Non-Stationary Approximate Modified Policy Iteration , 2015, ICML.

[18]  Joelle Pineau,et al.  Decoupling Dynamics and Reward for Transfer Learning , 2018, ICLR.

[19]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, ArXiv.

[20]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[21]  Robert L. Smith,et al.  Solving nonstationary infinite horizon stochastic production planning problems , 2000, Oper. Res. Lett..

[22]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[23]  Marcello Restelli,et al.  Foreign exchange trading: a risk-averse batch reinforcement learning approach , 2020, ICAIF.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[26]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[27]  Dit-Yan Yeung,et al.  Hidden-Mode Markov Decision Processes for Nonstationary Sequential Decision Making , 2001, Sequence Learning.

[28]  Ronald Ortner,et al.  Variational Regret Bounds for Reinforcement Learning , 2019, UAI.

[29]  M. de Rijke,et al.  When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments , 2019, WSDM.

[30]  Marcello Restelli,et al.  Importance Weighted Transfer of Samples in Reinforcement Learning , 2018, ICML.

[31]  Marcello Restelli,et al.  Importance Sampling Techniques for Policy Optimization , 2020, J. Mach. Learn. Res..

[32]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33]  Paulo Martins Engel,et al.  Dealing with non-stationary environments using context detection , 2006, ICML.

[34]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[35]  Trevor Darrell,et al.  Loss is its own Reward: Self-Supervision for Reinforcement Learning , 2016, ICLR.