A comprehensive survey on safe reinforcement learning

Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.

[1]  John McCarthy,et al.  Programs with common sense , 1960 .

[2]  J. Cockcroft Investment in Science , 1962, Nature.

[3]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[4]  Philip Klahr,et al.  Advice-Taking and Knowledge Refinement: An Iterative View of Skill Acquisition , 1980 .

[5]  M. J. Sobel,et al.  Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[6]  C. Watkins Learning from delayed rewards , 1989 .

[7]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[8]  Paul E. Utgoff,et al.  Two Kinds of Training Information For Evaluation Function Learning , 1991, AAAI.

[9]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[10]  Eitan Altman,et al.  Asymptotic properties of constrained Markov Decision Processes , 1993, ZOR Methods Model. Oper. Res..

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[13]  Paul E. Utgoff,et al.  On integrating apprentice learning and reinforcement learning , 1996 .

[14]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[15]  S. Marcus,et al.  Mixed Risk-Neutral/Minimax Control of Markov Decision Processes , 1997 .

[16]  J. Clouse On integrating apprentice learning and reinforcement learning TITLE2 , 1997 .

[17]  Gerald Sommer,et al.  Learning by biasing , 1998, Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146).

[18]  G. Cybenko,et al.  Minimax-based reinforcement learning with state aggregation , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[19]  Senén Barro,et al.  Supervised Reinforcement Learning: Application to a Wall Following Behaviour in a Mobile Robot , 1998, IEA/AIE.

[20]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[21]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[22]  Helmut Mausser,et al.  Beyond VaR: from measuring risk to managing risk , 1999, Proceedings of the IEEE/IAFE 1999 Conference on Computational Intelligence for Financial Engineering (CIFEr) (IEEE Cat. No.99TH8408).

[23]  Steven I. Marcus,et al.  Mixed risk-neutral/minimax control of discrete-time, finite-state Markov decision processes , 2000, IEEE Trans. Autom. Control..

[24]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[25]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[26]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[27]  Stephen D. Patek,et al.  On terminating Markov decision processes with a risk-averse objective function , 2001, Autom..

[28]  M. Rosenstein,et al.  Supervised Learning Combined with an Actor-Critic Architecture TITLE2: , 2002 .

[29]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[30]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[31]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[32]  Pedro Campos,et al.  Abalearn: Efficient Self-Play Learning of the game Abalone , 2003 .

[33]  Sven Koenig,et al.  Risk-averse auction agents , 2003, AAMAS '03.

[34]  Suman Chakravorty,et al.  Minimax Reinforcement Learning , 2003 .

[35]  Chris Gaskett,et al.  Reinforcement learning under circumstances beyond its control , 2003 .

[36]  Carlos V. Regueiro,et al.  Using Prior Knowledge to Improve Reinforcement Learning in Mobile Robotics , 2004 .

[37]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[38]  Saso Dzeroski,et al.  Integrating Guidance into Relational Reinforcement Learning , 2004, Machine Learning.

[39]  A. Moore,et al.  Learning decisions: robustness, uncertainty, and approximation , 2004 .

[40]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[41]  Gregory Kuhlmann and Peter Stone and Raymond J. Mooney and Shavlik Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer , 2004, AAAI 2004.

[42]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[43]  Richard Maclin,et al.  Knowledge-Based Support-Vector Regression for Reinforcement Learning , 2005 .

[44]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[45]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[46]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[47]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[48]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[49]  Jude W. Shavlik,et al.  Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another , 2005, ECML.

[50]  Frederic Maire,et al.  Apprenticeship Learning for Initial Value Functions in Reinforcement Learning , 2005, IJCAI 2005.

[51]  Giorgio Szegö,et al.  Measures of risk , 2002, Eur. J. Oper. Res..

[52]  Jude W. Shavlik,et al.  Creating Advice-Taking Reinforcement Learners , 1998, Machine Learning.

[53]  Jude W. Shavlik,et al.  Giving Advice about Preferred Actions to Reinforcement Learners Via Knowledge-Based Kernel Regression , 2005, AAAI.

[54]  Andrea Lockerd Thomaz,et al.  Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[55]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[56]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.

[57]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[58]  Masami Yasuda,et al.  Discounted Markov decision processes with utility constraints , 2006, Comput. Math. Appl..

[59]  Gerald Sommer,et al.  Evolutionary reinforcement learning of artificial neural networks , 2007, Int. J. Hybrid Intell. Syst..

[60]  Peter Stone,et al.  Representation Transfer for Reinforcement Learning , 2007, AAAI Fall Symposium: Computational Approaches to Representation Change during Learning and Development.

[61]  Hamdy A. Taha,et al.  Operations research: an introduction / Hamdy A. Taha , 1982 .

[62]  Changming Yin,et al.  Risk-sensitive reinforcement learning algorithms with generalized average criterion , 2007 .

[63]  Peter Stone,et al.  Transfer Learning via Inter-Task Mappings for Temporal Difference Learning , 2007, J. Mach. Learn. Res..

[64]  Brahim Chaib-draa,et al.  Reducing the complexity of multiagent reinforcement learning , 2007, AAMAS '07.

[65]  Hisashi Kashima Risk-Sensitive Learning via Minimization of Empirical Conditional Value-at-Risk , 2007, IEICE Trans. Inf. Syst..

[66]  Pieter Abbeel,et al.  Autonomous Autorotation of an RC Helicopter , 2008, ISER.

[67]  Pieter Abbeel,et al.  Apprenticeship learning and reinforcement learning with application to robotic control , 2008 .

[68]  Marcus Hutter,et al.  On the Possibility of Learning in Reactive Environments with Arbitrary Dependence , 2008, Theor. Comput. Sci..

[69]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[70]  Victor Uc Cetina Autonomous agent learning using an actor-critic algorithm and behavior models , 2008, AAMAS.

[71]  Carlos A. Coello Coello,et al.  Seeding the initial population of a multi-objective evolutionary algorithm using gradient-based information , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[72]  Vivek S. Borkar,et al.  A Learning Algorithm for Risk-Sensitive Cost , 2008, Math. Oper. Res..

[73]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[74]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[75]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[76]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[77]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[78]  Abhijit Gosavi Reinforcement learning for model building and variance-penalized control , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[79]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized helicopter control , 2009, GECCO.

[80]  Pierre-Yves Oudeyer,et al.  R-IAC: Robust Intrinsically Motivated Exploration and Active Learning , 2009, IEEE Transactions on Autonomous Mental Development.

[81]  Javier de Lope,et al.  Learning Autonomous Helicopter Flight with Evolutionary Reinforcement Learning , 2009 .

[82]  Manuela M. Veloso,et al.  Interactive Policy Learning through Confidence-Based Autonomy , 2014, J. Artif. Intell. Res..

[83]  Pierre-Yves Oudeyer,et al.  Robust intrinsically motivated exploration and active learning , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[84]  Naoki Abe,et al.  Optimizing debt collections using constrained reinforcement learning , 2010, KDD.

[85]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[86]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[87]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[88]  Thomas G. Dietterich,et al.  Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[89]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[90]  Pieter Abbeel,et al.  Parameterized maneuver learning for autonomous helicopter flight , 2010, 2010 IEEE International Conference on Robotics and Automation.

[91]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[92]  Javier García,et al.  Probabilistic Policy Reuse for inter-task transfer learning , 2010, Robotics Auton. Syst..

[93]  Francisco Javier García-Polo,et al.  Safe reinforcement learning in high-risk tasks through policy improvement , 2011, ADPRL.

[94]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized control of simulated helicopters , 2011, Evol. Intell..

[95]  Alborz Geramifard,et al.  UAV cooperative control with stochastic risk models , 2011, Proceedings of the 2011 American Control Conference.

[96]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[97]  Clayton T. Morrison,et al.  Blending Autonomous Exploration and Apprenticeship Learning , 2011, NIPS.

[98]  Lisa A. Torrey Help an Agent Out : Student / Teacher Learning in Sequential Decision Tasks , 2011 .

[99]  Matthew E. Taylor,et al.  Understanding Human Teaching Modalities in Reinforcement Learning Environments: A Preliminary Report , 2011 .

[100]  Sonia Chernova,et al.  Effect of human guidance and state space size on Interactive Reinforcement Learning , 2011, 2011 RO-MAN.

[101]  Michael L. Littman,et al.  Efficient model-based exploration in continuous state-space environments , 2011 .

[102]  Pradyot V. N. Korupolu,et al.  Beyond Rewards : Learning from Richer Supervision , 2011 .

[103]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[104]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[105]  Alborz Geramifard,et al.  Practical reinforcement learning using representation learning and safe exploration for large scale Markov decision processes , 2012 .

[106]  Matthew E. Taylor,et al.  Towards student/teacher learning in sequential decision tasks , 2012, AAMAS.

[107]  Takayuki Osogami,et al.  Robustness and risk-sensitivity in Markov decision processes , 2012, NIPS.

[108]  Yibin Li,et al.  An efficient initialization approach of Q-learning for mobile robots , 2012 .

[109]  Michael T. Rosenstein,et al.  Supervised Actor‐Critic Reinforcement Learning , 2012 .

[110]  Pieter Abbeel,et al.  Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds , 2012, NIPS.

[111]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[112]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[113]  Sameera S. Ponda,et al.  Risk allocation strategies for distributed chance-constrained task allocation , 2013, 2013 American Control Conference.

[114]  Carlos V. Regueiro,et al.  Learning on real robots from experience and simple user feedback , 2013 .

[115]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[116]  Alborz Geramifard,et al.  Intelligent Cooperative Control Architecture: A Framework for Performance Improvement Using Safe Learning , 2013, J. Intell. Robotic Syst..

[117]  Doina Precup,et al.  Smart exploration in reinforcement learning using absolute temporal difference errors , 2013, AAMAS.

[118]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[119]  Peter Kulchyski and , 2015 .