Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes

We study planning problems where autonomous agents operate inside environments that are subject to uncertainties and not fully observable. Partially observable Markov decision processes (POMDPs) are a natural formal model to capture such problems. Because of the potentially huge or even infinite belief space in POMDPs, synthesis with safety guarantees is, in general, computationally intractable. We propose an approach that aims to circumvent this difficulty: in scenarios that can be partially or fully simulated in a virtual environment, we actively integrate a human user to control an agent. While the user repeatedly tries to safely guide the agent in the simulation, we collect data from the human input. Via behavior cloning, we translate the data into a strategy for the POMDP. The strategy resolves all nondeterminism and non-observability of the POMDP, resulting in a discrete-time Markov chain (MC). The efficient verification of this MC gives quantitative insights into the quality of the inferred human strategy by proving or disproving given system specifications. For the case that the quality of the strategy is not sufficient, we propose a refinement method using counterexamples presented to the human. Experiments show that by including humans into the POMDP verification loop we improve the state of the art by orders of magnitude in terms of scalability.

[1]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[2]  Sebastian Junges,et al.  A Storm is Coming: A Modern Probabilistic Model Checker , 2017, CAV.

[3]  Krishnendu Chatterjee,et al.  Trading memory for randomness , 2004, First International Conference on the Quantitative Evaluation of Systems, 2004. QEST 2004. Proceedings..

[4]  Alan K. Mackworth,et al.  Artificial Intelligence - Foundations of Computational Agents , 2010 .

[5]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[6]  Sebastian Junges,et al.  Motion planning under partial observability using game-based abstraction , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[7]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[8]  Peter Beike Cognitive Models Of Memory , 2016 .

[9]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[10]  Emilio Frazzoli,et al.  Control of probabilistic systems under dynamic, partially known environments with temporal logic specifications , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[11]  Shlomo Zilberstein,et al.  Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs , 2010, Autonomous Agents and Multi-Agent Systems.

[12]  Emanuel Todorov,et al.  Inverse Optimal Control with Linearly-Solvable MDPs , 2010, ICML.

[13]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[14]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[15]  Stephanie Rosenthal,et al.  Modeling humans as observation providers using POMDPs , 2011, 2011 RO-MAN.

[16]  Nils Jansen,et al.  Synthesis of shared control protocols with provable safety and performance guarantees , 2017, 2017 American Control Conference (ACC).

[17]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[18]  Joost-Pieter Katoen,et al.  The Probabilistic Model Checking Landscape* , 2016, 2016 31st Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).

[19]  Jake K. Aggarwal,et al.  BWIBots: A platform for bridging the gap between AI and human–robot interaction research , 2017, Int. J. Robotics Res..

[20]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[21]  David Barber,et al.  On the Computational Complexity of Stochastic Controller Optimization in POMDPs , 2011, TOCT.

[22]  Konrad Paul Kording,et al.  Review TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Bayesian decision theory in sensorimotor control , 2022 .

[23]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[24]  L Poole David,et al.  Artificial Intelligence: Foundations of Computational Agents , 2010 .

[25]  Manuela M. Veloso,et al.  Oracular Partially Observable Markov Decision Processes: A Very Special Case , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[26]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[27]  VelosoManuela,et al.  A survey of robot learning from demonstration , 2009 .

[28]  Peter Stone,et al.  A Multiagent Approach to Autonomous Intersection Management , 2008, J. Artif. Intell. Res..

[29]  Marta Z. Kwiatkowska,et al.  PRISM 4.0: Verification of Probabilistic Real-Time Systems , 2011, CAV.

[30]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[31]  Nicholas Roy,et al.  Global A-Optimal Robot Exploration in SLAM , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[32]  Sebastian Junges,et al.  Permissive Finite-State Controllers of POMDPs using Parameter Synthesis , 2017, ArXiv.

[33]  Krishnendu Chatterjee,et al.  Qualitative analysis of POMDPs with temporal logic specifications for robotics applications , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[36]  Nils Jansen,et al.  Symbolic counterexample generation for large discrete-time Markov chains , 2014, Sci. Comput. Program..

[37]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[38]  Nils Jansen,et al.  Counterexample Generation for Discrete-Time Markov Models: An Introductory Survey , 2014, SFM.

[39]  Krishnendu Chatterjee,et al.  Optimal cost almost-sure reachability in POMDPs , 2014, Artif. Intell..

[40]  Gethin Norman,et al.  Verification and control of partially observable probabilistic systems , 2017, Real-Time Systems.

[41]  Shou-De Lin,et al.  Designing the Market Game for a Trading Agent Competition , 2001, IEEE Internet Comput..

[42]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[43]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[44]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.