Learning Reward Machines for Partially Observable Reinforcement Learning

Reward Machines (RMs), originally proposed for specifying problems in Reinforcement Learning (RL), provide a structured, automata-based representation of a reward function that allows an agent to decompose problems into subproblems that can be efficiently learned using off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.

[1]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[2]  Joelle Pineau,et al.  Learning Causal State Representations of Partially Observable Environments , 2019, ArXiv.

[3]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[4]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[5]  David Pisinger,et al.  Large Neighborhood Search , 2018, Handbook of Metaheuristics.

[6]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[9]  Alberto Camacho,et al.  LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning , 2019, IJCAI.

[10]  Christodoulos A. Floudas Generalized Benders Decomposition , 2009, Encyclopedia of Optimization.

[11]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[13]  Vijay Kumar,et al.  Memory Augmented Control Networks , 2017, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  David Pfau,et al.  Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[19]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[20]  G. Nemhauser,et al.  Integer Programming , 2020 .

[21]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[22]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[23]  Yves Crama,et al.  Local Search in Combinatorial Optimization , 2018, Artificial Neural Networks.

[24]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[25]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[26]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Ufuk Topcu,et al.  Joint Inference of Reward Machines and Policies for Reinforcement Learning , 2020, ICAPS.

[28]  Xi Yan,et al.  Symbolic Planning and Model-Free Reinforcement Learning: Training Taskable Agents , 2019 .

[29]  Matthias Scheutz,et al.  Interpretable apprenticeship learning with temporal logic specifications , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[30]  Manuel Laguna,et al.  Tabu Search , 1997 .

[31]  Padhraic Smyth,et al.  Learning Finite State Machines With Self-Clustering Recurrent Networks , 1993, Neural Computation.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  Silvano Martello,et al.  Meta-Heuristics: Advances and Trends in Local Search Paradigms for Optimization , 2012 .

[34]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[35]  Stavros Tripakis,et al.  Learning Moore machines from input–output traces , 2016, International Journal on Software Tools for Technology Transfer.

[36]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[37]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[38]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[39]  M. M. Hassan Mahmud,et al.  Constructing States for Reinforcement Learning , 2010, ICML.

[40]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[41]  Toryn Q. Klassen,et al.  Searching for Markovian Subproblems to Address Partially Observable Reinforcement Learning , 2019 .

[42]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..