Exploration-Exploitation in Constrained MDPs

In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

[1]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[2]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[3]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[4]  Lillian J. Ratliff,et al.  Constrained Upper Confidence Reinforcement Learning , 2020, L4DC.

[5]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[6]  Yisong Yue,et al.  Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes , 2018, AAAI.

[7]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[8]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[9]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[10]  Xiaohan Wei,et al.  Online Convex Optimization with Stochastic Constraints , 2017, NIPS.

[11]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[12]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[15]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2020, ICML.

[16]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[17]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[18]  E. Altman Constrained Markov Decision Processes , 1999 .

[19]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[20]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[21]  Haipeng Luo,et al.  Learning Adversarial MDPs with Bandit Feedback and Unknown Transition , 2019, ArXiv.

[22]  Alessandro Lazaric,et al.  Conservative Exploration in Reinforcement Learning , 2020, AISTATS.

[23]  Alessandro Lazaric,et al.  Improved Algorithms for Conservative Exploration in Bandits , 2020, AAAI.

[24]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[25]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[26]  Jeffrey P. Kharoufeh,et al.  Linear programming formulation for non-stationary, finite-horizon Markov decision process models , 2017, Oper. Res. Lett..

[27]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[28]  Nikhil R. Devanur,et al.  Bandits with Global Convex Constraints and Objective , 2019, Oper. Res..

[29]  Torsten Koller,et al.  Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning , 2019, ArXiv.

[30]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[31]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[32]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[33]  Rong Jin,et al.  Trading regret for efficiency: online convex optimization with long term constraints , 2011, J. Mach. Learn. Res..

[34]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[35]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[36]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[37]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[38]  John N. Tsitsiklis,et al.  On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies , 2005, Math. Oper. Res..

[39]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[40]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[41]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[42]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[43]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[44]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[45]  D. Bertsekas,et al.  Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control , 1977, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications.

[46]  Alejandro Ribeiro,et al.  Constrained Reinforcement Learning Has Zero Duality Gap , 2019, NeurIPS.

[47]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[48]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.