Reward-Machine-Guided, Self-Paced Reinforcement Learning

Self-paced reinforcement learning (RL) aims to improve the data efficiency of learning by automatically creating sequences, namely curricula, of probability distributions over contexts. However, existing techniques for self-paced RL fail in long-horizon planning tasks that involve temporally extended behaviors. We hypothesize that taking advantage of prior knowledge about the underlying task structure can improve the effectiveness of self-paced RL. We develop a self-paced RL algorithm guided by reward machines, i.e., a type of finite-state machine that encodes the underlying task structure. The algorithm integrates reward machines in 1) the update of the policy and value functions obtained by any RL algorithm of choice, and 2) the update of the automated curriculum that generates context distributions. Our empirical results evidence that the proposed algorithm achieves optimal behavior reliably even in cases in which existing baselines cannot make any meaningful progress. It also decreases the curriculum length and reduces the variance in the curriculum generation process by up to one-fourth and four orders of magnitude, respectively.

[1]  Hankz Hankui Zhuo,et al.  Lifelong Reinforcement Learning with Temporal Logic Formulas and Reward Machines , 2021, ArXiv.

[2]  Yu Wang,et al.  Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems , 2021, NeurIPS.

[3]  Edward Grefenstette,et al.  Replay-Guided Adversarial Environment Design , 2021, NeurIPS.

[4]  Marius Lindauer,et al.  Self-Paced Context Evaluation for Contextual Reinforcement Learning , 2021, ICML.

[5]  George Atia,et al.  Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search , 2021, AAAI.

[6]  Jan Peters,et al.  A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning , 2021, J. Mach. Learn. Res..

[7]  Edward Grefenstette,et al.  Prioritized Level Replay , 2020, ICML.

[8]  Sheila A. McIlraith,et al.  Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning , 2020, J. Artif. Intell. Res..

[9]  Pieter Abbeel,et al.  Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[10]  Andrei Barbu,et al.  Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Jan Peters,et al.  Self-Paced Deep Reinforcement Learning , 2020, NeurIPS.

[12]  Matthew E. Taylor,et al.  Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , 2020, J. Mach. Learn. Res..

[13]  Pierre-Yves Oudeyer,et al.  Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments , 2019, CoRL.

[14]  Jan Peters,et al.  Self-Paced Contextual Reinforcement Learning , 2019, CoRL.

[15]  Andrew Kyle Lampinen,et al.  Automated curricula through setter-solver interactions , 2019, ArXiv.

[16]  Zhe Xu,et al.  Joint Inference of Reward Machines and Policies for Reinforcement Learning , 2019, ICAPS.

[17]  Alberto Camacho,et al.  LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning , 2019, IJCAI.

[18]  Ufuk Topcu,et al.  Transfer of Temporal Logic Formulas in Reinforcement Learning , 2019, IJCAI.

[19]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[20]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[21]  Daoyi Dong,et al.  Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[23]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[24]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[25]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[26]  Peter Stone,et al.  Automatic Curriculum Graph Generation for Reinforcement Learning Agents , 2017, AAAI.

[27]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[29]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[30]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[31]  Pierre-Yves Oudeyer,et al.  Intrinsically motivated goal exploration for active motor learning in robots: A case study , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[33]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[34]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[35]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[36]  Craig Boutilier,et al.  Rewarding Behaviors , 1996, AAAI/IAAI, Vol. 2.

[37]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[38]  U. Topcu,et al.  Risk-aware curriculum generation for heavy-tailed task distributions , 2023, Conference on Uncertainty in Artificial Intelligence.

[39]  J. Pajarinen,et al.  Curriculum Reinforcement Learning via Constrained Optimal Transport , 2022, ICML.

[40]  Matthew F. Dixon,et al.  Applications of Reinforcement Learning , 2020 .