论文信息 - Long-term Planning by Short-term Prediction

Long-term Planning by Short-term Prediction

We consider planning problems, that often arise in autonomous driving applications, in which an agent should decide on immediate actions so as to optimize a long term objective. For example, when a car tries to merge in a roundabout it should decide on an immediate acceleration/braking command, while the long term effect of the command is the success/failure of the merge. Such problems are characterized by continuous state and action spaces, and by interaction with multiple agents, whose behavior can be adversarial. We argue that dual versions of the MDP framework (that depend on the value function and the $Q$ function) are problematic for autonomous driving applications due to the non Markovian of the natural state space representation, and due to the continuous state and action spaces. We propose to tackle the planning task by decomposing the problem into two phases: First, we apply supervised learning for predicting the near future based on the present. We require that the predictor will be differentiable with respect to the representation of the present. Second, we model a full trajectory of the agent using a recurrent neural network, where unexplained factors are modeled as (additive) input nodes. This allows us to solve the long-term planning problem using supervised learning techniques and direct optimization over the recurrent neural network. Our approach enables us to learn robust policies by incorporating adversarial elements to the environment.

[1] G. DeJong,et al. Theory and Application of Reward Shaping in Reinforcement Learning , 2004 .

[2] William T. B. Uther,et al. Adversarial Reinforcement Learning , 2003 .

[3] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[4] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[5] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[6] S. Hart,et al. A simple adaptive procedure leading to correlated equilibrium , 2000 .

[7] Richard Bellman,et al. Introduction to the mathematical theory of control processes , 1967 .

[8] Chelsea C. White,et al. A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[9] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[10] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11] R Bellman,et al. DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[12] John L. Casti. Introduction to the Mathematical Theory of Control Processes, Volume I: Linear Equations and Quadratic Criteria, Volume II: Nonlinear Processes , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[13] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[14] Anton Maximilian Schäfer,et al. Reinforcement learning with recurrent neural networks , 2008 .

[15] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[16] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[17] Michael P. Wellman,et al. Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[18] Yoav Shoham,et al. If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[19] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[20] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[21] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[22] O. H. Brownlee,et al. ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[23] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.