Assured Reinforcement Learning with Formally Verified Abstract Policies

We present a new reinforcement learning (RL) approach that enables an autonomous agent to solve decision making problems under constraints. Our assured reinforcement learning approach models the uncertain environment as a high-level, abstract Markov decision process (AMDP), and uses probabilistic model checking to establish AMDP policies that satisfy a set of constraints defined in probabilistic temporal logic. These formally verified abstract policies are then used to restrict the RL agent’s exploration of the solution space so as to avoid constraint violations. We validate our RL approach by using it to develop autonomous agents for a flag-collection navigation task and an assisted-living planning problem.

[1]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[2]  Radu Calinescu,et al.  Efficient runtime quantitative verification using caching, lookahead, and nearly-optimal reconfiguration , 2014, SEAMS 2014.

[3]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[4]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.

[5]  Istvan Szita,et al.  Reinforcement Learning in Games , 2012, Reinforcement Learning.

[6]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[7]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[8]  Joost-Pieter Katoen,et al.  Discrete-Time Rewards Model-Checked , 2003, FORMATS.

[9]  Saso Dzeroski,et al.  Integrating Guidance into Relational Reinforcement Learning , 2004, Machine Learning.

[10]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[11]  Radu Calinescu,et al.  Developing self-verifying service-based systems , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Radu Calinescu,et al.  Compositional Reverification of Probabilistic Safety Properties for Large-Scale Complex IT Systems , 2012, Monterey Workshop.

[13]  Marco Wiering,et al.  Reinforcement Learning and Markov Decision Processes , 2012, Reinforcement Learning.

[14]  Sameera S. Ponda,et al.  Risk allocation strategies for distributed chance-constrained task allocation , 2013, 2013 American Control Conference.

[15]  Bhaskara Marthi,et al.  Automatic shaping and decomposition of reward functions , 2007, ICML '07.

[16]  Dewen Hu,et al.  Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[17]  Radu Calinescu,et al.  Adaptive model learning for continual verification of non-functional properties , 2014, ICPE.

[18]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[19]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[20]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[21]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[22]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[23]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[24]  Naoki Abe,et al.  Optimizing debt collections using constrained reinforcement learning , 2010, KDD.

[25]  Li Xia,et al.  Policy iteration for parameterized Markov decision processes and its application , 2013, 2013 9th Asian Control Conference (ASCC).

[26]  Radu Calinescu,et al.  Using observation ageing to improve markovian model learning in QoS engineering , 2011, ICPE '11.

[27]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[28]  Radu Calinescu,et al.  Search-Based Synthesis of Probabilistic Models for Quality-of-Service Software Engineering (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Marta Z. Kwiatkowska Quantitative verification: models techniques and tools , 2007, ESEC-FSE '07.

[30]  Marta Z. Kwiatkowska,et al.  Stochastic Model Checking , 2007, SFM.

[31]  Joost-Pieter Katoen,et al.  The Ins and Outs of the Probabilistic Model Checker MRMC , 2009, 2009 Sixth International Conference on the Quantitative Evaluation of Systems.

[32]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[33]  Bengt Jonsson,et al.  A logic for reasoning about time and reliability , 1990, Formal Aspects of Computing.

[34]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[35]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[36]  Jesse Hoey,et al.  A planning system based on Markov decision processes to guide people with dementia through activities of daily living , 2006, IEEE Transactions on Information Technology in Biomedicine.

[37]  Ilham Benyahia,et al.  Multicriteria reinforcement learning based on a Russian doll method for network routing , 2010, 2010 5th IEEE International Conference Intelligent Systems.

[38]  Marta Z. Kwiatkowska,et al.  PRISM 4.0: Verification of Probabilistic Real-Time Systems , 2011, CAV.

[39]  Sam Devlin,et al.  Knowledge revision for reinforcement learning with abstract MDPs , 2014, AAMAS.