Safe Q-Learning Method Based on Constrained Markov Decision Processes

The application of reinforcement learning in industrial fields makes the safety problem of the agent a research hotspot. Traditional methods mainly alter the objective function and the exploration process of the agent to address the safety problem. Those methods, however, can hardly prevent the agent from falling into dangerous states because most of the methods ignore the damage caused by unsafe states. As a result, most solutions are not satisfactory. In order to solve the aforementioned problem, we come forward with a safe Q-learning method that is based on constrained Markov decision processes, adding safety constraints as prerequisites to the model, which improves standard Q-learning algorithm so that the proposed algorithm seeks for the optimal solution ensuring that the safety premise is satisfied. During the process of finding the solution in form of the optimal state-action value, the feasible space of the agent is limited to the safe space that guarantees the safety via the feasible space being filtered by constraints added to the action space. Because the traditional solution methods are not applicable to the safe Q-learning model as they tend to obtain local optimal solution, we take advantage of the Lagrange multiplier method to solve the optimal action that can be performed in the current state based on the premise of linearizing constraint functions, which not only improves the efficiency and accuracy of the algorithm, but also guarantees to obtain the global optimal solution. The experiments verify the effectiveness of the algorithm.

[1]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[2]  H. Takata,et al.  Nonlinear feedback control of stabilization problem via formal linearization using Taylor expansion , 2008, 2008 International Symposium on Information Theory and Its Applications.

[3]  E. Altman Constrained Markov Decision Processes , 1999 .

[4]  Giuseppe Notarstefano,et al.  Asynchronous Distributed Method of Multipliers for Constrained Nonconvex optimization , 2018, 2018 European Control Conference (ECC).

[5]  Javad Mahmoudimehr,et al.  A novel multi-objective Dynamic Programming optimization method: Performance management of a solar thermal power plant as a case study , 2019, Energy.

[6]  Richard M. Golden,et al.  Adaptive Learning Algorithm Convergence in Passive and Reactive Environments , 2018, Neural Computation.

[7]  Victor C. M. Leung,et al.  Deep-Reinforcement-Learning-Based Optimization for Cache-Enabled Opportunistic Interference Alignment Wireless Networks , 2017, IEEE Transactions on Vehicular Technology.

[8]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[9]  Ufuk Topcu,et al.  Constrained Cross-Entropy Method for Safe Reinforcement Learning , 2020, IEEE Transactions on Automatic Control.

[10]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[11]  Jerzy Martyna Power Allocation in Cognitive Radio with Distributed Antenna System , 2017, NEW2AN.

[12]  Jiafeng Guo,et al.  Reinforcement Learning to Rank with Markov Decision Process , 2017, SIGIR.

[13]  Yongqiang Li,et al.  Data-driven approximate value iteration with optimality error bound analysis , 2017, Autom..

[14]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[15]  Ana Busic,et al.  Action-Constrained Markov Decision Processes With Kullback-Leibler Cost , 2018, COLT.

[16]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[17]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[20]  Mengmou Li,et al.  Generalized Lagrange Multiplier Method and KKT Conditions With an Application to Distributed Optimization , 2019, IEEE Transactions on Circuits and Systems II: Express Briefs.

[21]  Jonathan D. Cohen,et al.  Toward a Rational and Mechanistic Account of Mental Effort. , 2017, Annual review of neuroscience.

[22]  Tingwen Huang,et al.  Model-Free Optimal Tracking Control via Critic-Only Q-Learning , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Saso Dzeroski,et al.  Integrating Guidance into Relational Reinforcement Learning , 2004, Machine Learning.

[24]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[25]  Behçet Açıkmeşe,et al.  Controlled Markov Processes With Safety State Constraints , 2019, IEEE Transactions on Automatic Control.

[26]  Ta-Wen Kuan,et al.  VLSI Design of an SVM Learning Core on Sequential Minimal Optimization Algorithm , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Paulo J. S. Silva,et al.  Convergence Properties of a Second Order Augmented Lagrangian Method for Mathematical Programs with Complementarity Constraints , 2018, SIAM J. Optim..

[28]  Vivek S. Borkar,et al.  A Learning Algorithm for Risk-Sensitive Cost , 2008, Math. Oper. Res..

[29]  Nguyen Dinh,et al.  An approach to calmness of linear inequality systems from Farkas lemma , 2019, Optim. Lett..

[30]  Ather Gattami,et al.  Reinforcement Learning for Multi-Objective and Constrained Markov Decision Processes. , 2019, 1901.08978.

[31]  Andreas Krause,et al.  Safe controller optimization for quadrotors with Gaussian processes , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Karina Valdivia Delgado,et al.  Risk-Sensitive Markov Decision Process with Limited Budget , 2017, 2017 Brazilian Conference on Intelligent Systems (BRACIS).

[33]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[34]  Sylvain Calinon,et al.  A tutorial on task-parameterized movement learning and retrieval , 2015, Intelligent Service Robotics.

[35]  Li Xia Optimization of Markov decision processes under the variance criterion , 2016, Autom..

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[38]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[39]  Ioannis P. Vlahavas,et al.  Learning to Teach Reinforcement Learning Agents , 2017, Mach. Learn. Knowl. Extr..

[40]  Zengxin Wei,et al.  On the Constant Positive Linear Dependence Condition and Its Application to SQP Methods , 1999, SIAM J. Optim..

[41]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[42]  Peter Winkler,et al.  The minimum Manhattan distance and minimum jump of permutations , 2019, J. Comb. Theory, Ser. A.

[43]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[44]  Doina Precup,et al.  Smart exploration in reinforcement learning using absolute temporal difference errors , 2013, AAMAS.

[45]  Etienne Perot,et al.  Deep Reinforcement Learning framework for Autonomous Driving , 2017, Autonomous Vehicles and Machines.