论文信息 - Reduced variance deep reinforcement learning with temporal logic specifications

Reduced variance deep reinforcement learning with temporal logic specifications

In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.

[1] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[2] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[3] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[4] Jing Wang,et al. Temporal logic motion control using actor-critic methods , 2012, ICRA.

[5] J. Kiefer,et al. Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[6] B. V. Dean,et al. Studies in Linear and Non-Linear Programming. , 1959 .

[7] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[8] S. Shankar Sastry,et al. A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications , 2014, 53rd IEEE Conference on Decision and Control.

[9] Michael M. Zavlanos,et al. Probabilistic Motion Planning Under Temporal Tasks and Soft Constraints , 2017, IEEE Transactions on Automatic Control.

[10] Calin Belta,et al. Reinforcement learning with temporal logic rewards , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[12] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13] Calin Belta,et al. Incremental controller synthesis in probabilistic environments with temporal logic constraints , 2014, Int. J. Robotics Res..

[14] Alexander J. Smola,et al. Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[15] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[16] K. Schittkowski,et al. NONLINEAR PROGRAMMING , 2022 .

[17] Calin Belta,et al. Optimal Control of Markov Decision Processes With Linear Temporal Logic Constraints , 2014, IEEE Transactions on Automatic Control.

[18] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Calin Belta,et al. A Policy Search Method For Temporal Logic Specified Reinforcement Learning Tasks , 2017, 2018 Annual American Control Conference (ACC).

[21] Calin Belta,et al. Formal Methods for Discrete-Time Dynamical Systems , 2017 .

[22] Yi Zhou,et al. An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[23] Dimos V. Dimarogonas,et al. Multi-agent plan reconfiguration under local LTL specifications , 2015, Int. J. Robotics Res..

[24] Ufuk Topcu,et al. Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints , 2014, Robotics: Science and Systems.

[25] B. V. Dean,et al. Studies in Linear and Non-Linear Programming. , 1959 .

[26] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[27] Emilio Frazzoli,et al. Sampling-based algorithms for optimal motion planning , 2011, Int. J. Robotics Res..

[28] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[29] Michael M. Zavlanos,et al. Sampling-Based Optimal Control Synthesis for Multirobot Systems Under Global Temporal Tasks , 2017, IEEE Transactions on Automatic Control.

[30] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[31] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[32] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[33] Zhaoran Wang,et al. NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization , 2016, NIPS.

[34] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[35] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[37] Michael M. Zavlanos,et al. Distributed Optimal Control Synthesis for Multi-Robot Systems under Global Temporal Tasks , 2018, 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS).

[38] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[39] Christel Baier,et al. Principles of model checking , 2008 .

[40] Dimitri P. Bertsekas,et al. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.