Joint Inference of Reward Machines and Policies for Reinforcement Learning

Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes the reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives q-functions from the current hypothesis reward machine, and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long. The experiments show that learning high-level knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as q-learning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.

[1]  Sheila A. McIlraith,et al.  Learning Reward Machines for Partially Observable Reinforcement Learning , 2019, NeurIPS.

[2]  Christof Löding,et al.  Abstract Learning Frameworks for Synthesis , 2015, TACAS.

[3]  Marijn J. H. Heule,et al.  Exact DFA Identification Using SAT Solvers , 2010, ICGI.

[4]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[5]  Alberto Camacho,et al.  LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning , 2019, IJCAI.

[6]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[7]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[8]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[9]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[10]  Ufuk Topcu,et al.  Transfer of Temporal Logic Formulas in Reinforcement Learning , 2019, IJCAI.

[11]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[12]  Nils Jansen,et al.  Regular Model Checking Using Solver Technologies and Automata Learning , 2013, NASA Formal Methods.

[13]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[14]  Benedikt Bollig,et al.  libalf: The Automata Learning Framework , 2010, CAV.

[15]  Calin Belta,et al.  Q-Learning for robust satisfaction of signal temporal logic specifications , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[16]  Calin Belta,et al.  Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[18]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Ufuk Topcu,et al.  Learning from Demonstrations with High-Level Side Information , 2017, IJCAI.

[21]  Prashant Doshi,et al.  A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress , 2018, Artif. Intell..

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Ufuk Topcu,et al.  Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[24]  Peter Stone,et al.  Cross-domain transfer for reinforcement learning , 2007, ICML '07.

[25]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[26]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[27]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[28]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[29]  Daniel Neider,et al.  Applications of automata learning in verification and synthesis , 2014 .

[30]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[31]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[32]  Jan Peters,et al.  Regularizing Reinforcement Learning with State Abstraction , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Jeffrey Shallit,et al.  A Second Course in Formal Languages and Automata Theory , 2008 .

[35]  Michael L. Littman,et al.  State Abstractions for Lifelong Reinforcement Learning , 2018, ICML.