Planning With Uncertain Specifications (PUnS)

Reward engineering is crucial to high performance in reinforcement learning systems. Prior research into reward design has largely focused on Markovian functions representing the reward. While there has been research into expressing non-Markov rewards as linear temporal logic (LTL) formulas, this has focused on task specifications directly defined by the user. However, in many real-world applications, task specifications are ambiguous, and can only be expressed as a belief over LTL formulas. In this letter, we introduce planning with uncertain specifications (PUnS), a novel formulation that addresses the challenge posed by non-Markovian specifications expressed as beliefs over LTL formulas. We present four criteria that capture the semantics of satisfying a belief over specifications for different applications, and analyze the qualitative implications of these criteria within a synthetic domain. We demonstrate the existence of an equivalent Markov decision process (MDP) for any instance of PUnS. Finally, we demonstrate our approach on the real-world task of setting a dinner table automatically with a robot that inferred task specifications from human demonstrations.

[1]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[2]  Sheila A. McIlraith,et al.  Teaching Multiple Tasks to an RL Agent using LTL , 2018, AAMAS.

[3]  Alberto Camacho,et al.  Finite LTL Synthesis as Planning , 2018, ICAPS.

[4]  Alberto Camacho,et al.  Strong Fully Observable Non-Deterministic Planning with LTL and LTLf Goals , 2019, IJCAI.

[5]  Jorge A. Baier,et al.  A Heuristic Search Approach to Planning with Temporally Extended Preferences , 2007, IJCAI.

[6]  Joseph Kim,et al.  Collaborative Planning with Encoding of Users' High-Level Strategies , 2017, AAAI.

[7]  Craig Boutilier,et al.  Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies , 2010, AAAI.

[8]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[9]  Scott Sanner,et al.  Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping , 2021, SOCS.

[10]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[11]  Patrik Haslum,et al.  Deterministic planning in the fifth international planning competition: PDDL3 and experimental evaluation of the planners , 2009, Artif. Intell..

[12]  Shen Li,et al.  Bayesian Inference of Temporal Task Specifications from Demonstrations , 2018, NeurIPS.

[13]  Sheila A. McIlraith,et al.  Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , 2018, ICML.

[14]  Calin Belta,et al.  Q-Learning for robust satisfaction of signal temporal logic specifications , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[15]  Orna Kupferman,et al.  Model Checking of Safety Properties , 1999, Formal Methods Syst. Des..

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Fred Kröger,et al.  Temporal Logic of Programs , 1987, EATCS Monographs on Theoretical Computer Science.

[18]  Fahiem Bacchus,et al.  Using temporal logics to express search control knowledge for planning , 2000, Artif. Intell..

[19]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[20]  Nick Hawes,et al.  Optimal Policy Generation for Partially Satisfiable Co-Safe LTL Specifications , 2015, IJCAI.

[21]  Moshe Y. Vardi An Automata-Theoretic Approach to Linear Temporal Logic , 1996, Banff Higher Order Workshop.

[22]  Anca D. Dragan,et al.  Simplifying Reward Design through Divide-and-Conquer , 2018, Robotics: Science and Systems.

[23]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[24]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[25]  Hadas Kress-Gazit,et al.  Temporal-Logic-Based Reactive Mission and Motion Planning , 2009, IEEE Transactions on Robotics.

[26]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[27]  Alberto Camacho,et al.  LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning , 2019, IJCAI.

[28]  Christian Muise,et al.  Bayesian Inference of Linear Temporal Logic Specifications for Contrastive Explanations , 2019, IJCAI.

[29]  Pierre Wolper,et al.  Simple on-the-fly automatic verification of linear temporal logic , 1995, PSTV.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[32]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[33]  Derek Long,et al.  Plan Constraints and Preferences in PDDL3 , 2006 .

[34]  Stefanie Tellex,et al.  Planning with State Abstractions for Non-Markovian Task Specifications , 2019, Robotics: Science and Systems.

[35]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[36]  Craig Boutilier,et al.  Rewarding Behaviors , 1996, AAAI/IAAI, Vol. 2.