Robust Anytime Learning of Markov Decision Processes

Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. Furthermore, our method is capable of adapting to changes in the environment. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.

[1]  Giovanni Bacci,et al.  Active Learning of Markov Decision Processes using Baum-Welch algorithm , 2021, 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA).

[2]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[3]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[4]  Nick Hawes,et al.  Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes , 2020, AAAI.

[5]  U. Topcu,et al.  Robust Finite-State Controllers for Uncertain POMDPs , 2020, AAAI.

[6]  Marc G. Bellemare,et al.  The Importance of Pessimism in Fixed-Dataset Policy Optimization , 2020, ICLR.

[7]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[8]  Ufuk Topcu,et al.  Robust Policy Synthesis for Uncertain POMDPs via Convex Optimization , 2020, IJCAI.

[9]  Chi Jin,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2019, ICML.

[10]  Willem Waegeman,et al.  Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , 2019, Machine Learning.

[11]  Kim G. Larsen,et al.  L*-Based Learning of Markov Decision Processes , 2019, FM.

[12]  Shie Mannor,et al.  A Bayesian Approach to Robust Reinforcement Learning , 2019, UAI.

[13]  Jan Kretínský,et al.  PAC Statistical Model Checking for Markov Decision Processes and Stochastic Games , 2019, CAV.

[14]  Marek Petrik,et al.  Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[15]  Vineet Goyal,et al.  Robust Markov Decision Process: Beyond Rectangularity , 2018, 1811.00215.

[16]  Alessandro Lazaric,et al.  Regret Minimization in MDPs with Options without Prior Knowledge , 2017, NIPS.

[17]  Bart Jacobs,et al.  A channel-based perspective on conjugate priors , 2017, Mathematical Structures in Computer Science.

[18]  Frits W. Vaandrager,et al.  Model learning , 2017, Commun. ACM.

[19]  Jonathan P. How,et al.  Decision Making Under Uncertainty: Theory and Application , 2015 .

[20]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[21]  Shie Mannor,et al.  Reinforcement Learning in Robust Markov Decision Processes , 2013, Math. Oper. Res..

[22]  Alberto L. Sangiovanni-Vincentelli,et al.  Polynomial-Time Verification of PCTL Properties of MDPs with Convex Uncertainties , 2013, CAV.

[23]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[24]  Ufuk Topcu,et al.  Robust control of uncertain Markov Decision Processes with temporal logic specifications , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[25]  Nicole Bäuerle,et al.  Markov Decision Processes with Average-Value-at-Risk criteria , 2011, Math. Methods Oper. Res..

[26]  Olivier Buffet,et al.  Active Learning of MDP Models , 2011, EWRL.

[27]  Marta Z. Kwiatkowska,et al.  PRISM 4.0: Verification of Probabilistic Real-Time Systems , 2011, CAV.

[28]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[29]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[30]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[31]  Christel Baier,et al.  Principles of model checking , 2008 .

[32]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[33]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[34]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[35]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[36]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37]  Kim Guldstrand Larsen,et al.  Specification and refinement of probabilistic processes , 1991, [1991] Proceedings Sixth Annual IEEE Symposium on Logic in Computer Science.

[38]  Bernhard K. Aichernig,et al.  Active Model Learning of Stochastic Reactive Systems , 2021, SEFM.

[39]  Wenlong Fu,et al.  Model-based reinforcement learning: A survey , 2018 .

[40]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[41]  Gm Gero Walter,et al.  Imprecision and Prior-data Conflict in Generalized Bayesian Inference , 2009 .

[42]  Christopher M. Bishop,et al.  Pattern recognition and machine learning, 5th Edition , 2007, Information science and statistics.

[43]  Dimitri P. Bertsekas,et al.  Dynamic programming and optimal control, 3rd Edition , 2005 .

[44]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[45]  Fred Kröger,et al.  Temporal Logic of Programs , 1987, EATCS Monographs on Theoretical Computer Science.