Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

We consider synthesis of control policies that maximize the probability of satisfying given temporal logic specifications in unknown, stochastic environments. We model the interaction between the system and its environment as a Markov decision process (MDP) with initially unknown transition probabilities. The solution we develop builds on the so-called model-based probably approximately correct Markov decision process (PAC-MDP) methodology. The algorithm attains an $\varepsilon$-approximately optimal policy with probability $1-\delta$ using samples (i.e. observations), time and space that grow polynomially with the size of the MDP, the size of the automaton expressing the temporal logic specification, $\frac{1}{\varepsilon}$, $\frac{1}{\delta}$ and a finite time horizon. In this approach, the system maintains a model of the initially unknown MDP, and constructs a product MDP based on its learned model and the specification automaton that expresses the temporal logic constraints. During execution, the policy is iteratively updated using observation of the transitions taken by the system. The iteration terminates in finitely many steps. With high probability, the resulting policy is such that, for any state, the difference between the probability of satisfying the specification under this policy and the optimal one is within a predefined bound.

[1]  William Turin,et al.  Probability, Random Processes, and Statistical Analysis: Statistical inference , 2011 .

[2]  John C. Mitchell,et al.  Exploring New Frontiers of Theoretical Informatics , 2004, IFIP International Federation for Information Processing.

[3]  Edmund M. Clarke,et al.  Statistical Model Checking for Markov Decision Processes , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[4]  N. Balakrishnan,et al.  A Primer on Statistical Distributions , 2003 .

[5]  Andrea Bianco,et al.  Model Checking of Probabalistic and Nondeterministic Systems , 1995, FSTTCS.

[6]  Kim G. Larsen,et al.  Learning Markov Decision Processes for Model Checking , 2012, QFM.

[7]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[8]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[9]  Jie Fu,et al.  Adaptive planning in unknown environments using grammatical inference , 2013, 52nd IEEE Conference on Decision and Control.

[10]  Krishnendu Chatterjee,et al.  Symbolic algorithms for qualitative analysis of Markov decision processes with Büchi objectives , 2011, Formal Methods Syst. Des..

[11]  Calin Belta,et al.  A probabilistic approach for control of a stochastic system from LTL specifications , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[12]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[13]  Krishnendu Chatterjee,et al.  Symbolic Algorithms for Qualitative Analysis of Markov Decision Processes with Büchi Objectives , 2011, CAV.

[14]  Jan J. M. M. Rutten,et al.  Mathematical techniques for analyzing concurrent and probabilistic systems , 2004, CRM monograph series.

[15]  Zohar Manna,et al.  Formal verification of probabilistic systems , 1997 .

[16]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[17]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[18]  Hadas Kress-Gazit,et al.  Analyzing and revising high-level robot behaviors under actuator error , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Yushan Chen,et al.  LTL robot motion control based on automata learning of environmental dynamics , 2012, 2012 IEEE International Conference on Robotics and Automation.

[20]  Hisashi Kobayashi,et al.  Probability, Random Processes, and Statistical Analysis: Moment-generating function and characteristic function , 2011 .

[21]  Krishnendu Chatterjee,et al.  Faster and dynamic algorithms for maximal end-component decomposition and related graph problems in probabilistic verification , 2011, SODA '11.

[22]  Calin Belta,et al.  Probabilistic control from time-bounded temporal logic specifications in dynamic environments , 2012, 2012 IEEE International Conference on Robotics and Automation.

[23]  Axel Legay,et al.  Statistical Model Checking: An Overview , 2010, RV.

[24]  Ufuk Topcu,et al.  Optimal Control with Weighted Average Costs and Temporal Logic Specifications , 2012, Robotics: Science and Systems.

[25]  Christel Baier,et al.  Controller Synthesis for Probabilistic Systems , 2004, IFIP TCS.

[26]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..