Safe Policy Optimization with Local Generalized Linear Function Approximations

Safe exploration is a key to applying reinforcement learning (RL) in safety-critical systems. Existing safe exploration methods guaranteed safety under the assumption of regularity, and it has been difficult to apply them to large-scale real problems. We propose a novel algorithm, SPO-LF, that optimizes an agent’s policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety using generalized linear function approximations. We provide theoretical guarantees on its safety and optimality. We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees, and 3) comparably sample-efficient and safer compared with existing advanced deep RL methods with safety constraints.

[1]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[2]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[3]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[4]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[5]  Yongshuai Liu,et al.  IPO: Interior-point Policy Optimization under Constraints , 2019, AAAI.

[6]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[7]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[8]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[9]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[10]  Nathan Fulton,et al.  Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning , 2018, AAAI.

[11]  Felix Berkenkamp,et al.  Safe Exploration for Interactive Machine Learning , 2019, NeurIPS.

[12]  Luiz F. O. Chamon,et al.  Safe Policies for Reinforcement Learning via Primal-Dual Methods , 2019, IEEE Transactions on Automatic Control.

[13]  Dorsa Sadigh,et al.  Efficient and Safe Exploration in Deterministic Markov Decision Processes with Unknown Transition Models , 2019, 2019 American Control Conference (ACC).

[14]  Xiaohan Wei,et al.  Provably Efficient Safe Exploration via Primal-Dual Policy Optimization , 2021, AISTATS.

[15]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[16]  E. Altman Constrained Markov Decision Processes , 1999 .

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[19]  Yanan Sui,et al.  Safe Reinforcement Learning in Constrained Markov Decision Processes , 2020, ICML.

[20]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[21]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[22]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[23]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[24]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[25]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[26]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[27]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[28]  Joelle Pineau,et al.  Constrained Markov Decision Processes via Backward Value Functions , 2020, ICML.

[29]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[30]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[31]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[32]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[33]  Yisong Yue,et al.  Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes , 2018, AAAI.

[34]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[35]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[36]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[37]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[38]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.