论文信息 - Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Author(s): Schulman, John | Advisor(s): Abbeel, Pieter | Abstract: This thesis is mostly focused on reinforcement learning, which is viewed as an optimization problem: maximize the expected total reward with respect to the parameters of the policy.The first part of the thesis is concerned with making policy gradient methods more sample-efficient and reliable, especially when used with expressive nonlinear function approximators such as neural networks. Chapter 3 considers how to ensure that policy updates lead to monotonic improvement, and how to optimally update a policy given a batch of sampled trajectories. After providing a theoretical analysis, we propose a practical method called trust region policy optimization (TRPO), which performs well on two challenging tasks: simulated robotic locomotion, and playing Atari games using screen images as input. Chapter 4 looks at improving sample complexity of policy gradient methods in a way that is complementary to TRPO: reducing the variance of policy gradient estimates using a state-value function. Using this method, we obtain state-of-the-art results for learning locomotion controllers for simulated 3D robots.Reinforcement learning can be viewed as a special case of optimizing an expectation, and similar optimization problems arise in other areas of machine learning; for example, in variational inference, and when using architectures that include mechanisms for memory and attention. Chapter 5 provides a unifying view of these problems, with a general calculus for obtaining gradient estimators of objectives that involve a mixture of sampled random variables and differentiable operations. This unifying view motivates applying algorithms from reinforcement learning to other prediction and probabilistic modeling problems.

John Schulman | J. Schulman

[1] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[2] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[3] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5] Karol Gregor,et al. Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7] Rémi Munos,et al. Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[8] Florentin Wörgötter,et al. Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[9] Pieter Abbeel,et al. Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[10] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[11] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[12] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[13] Risto Miikkulainen,et al. HyperNEAT-GGP: a hyperNEAT-based atari general game player , 2012, GECCO '12.

[14] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[16] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[18] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.

[19] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[20] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[21] Elizabeth L. Wilmer,et al. Markov Chains and Mixing Times , 2008 .

[22] Vijay R. Konda,et al. OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[23] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[24] Noah J. Cowan,et al. Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[25] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[26] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[27] Shigenobu Kobayashi,et al. An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[28] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[29] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[30] Bruno Scherrer,et al. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[31] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[32] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[33] D. Hunter,et al. A Tutorial on MM Algorithms , 2004 .

[34] Sham M. Kakade,et al. Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[35] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[36] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[37] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[38] Marc Toussaint,et al. Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[39] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[40] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[41] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[42] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[43] Jürgen Schmidhuber,et al. Recurrent policy gradients , 2010, Log. J. IGPL.

[44] Stuart J. Russell,et al. Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[45] John Langford,et al. Search-based structured prediction , 2009, Machine Learning.

[46] Honglak Lee,et al. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[47] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[48] Sergey Levine,et al. Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[49] Max Welling,et al. Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets , 2014, ICML.

[50] Kumpati S. Narendra,et al. Identification and control of dynamical systems using neural networks , 1990, IEEE Trans. Neural Networks.

[51] András Lörincz,et al. Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[52] Andreas Griewank,et al. Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[53] Daan Wierstra,et al. Deep AutoRegressive Networks , 2013, ICML.

[54] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[55] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[56] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[57] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[58] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[59] Paul Glasserman,et al. Monte Carlo Methods in Financial Engineering , 2003 .

[60] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[61] John N. Tsitsiklis,et al. Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[62] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[63] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[64] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[65] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[66] H. Sebastian Seung,et al. Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[67] Sean Gerrish,et al. Black Box Variational Inference , 2013, AISTATS.

[68] Griewank,et al. On automatic differentiation , 1988 .

[69] Geoffrey E. Hinton,et al. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[70] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[71] Benjamin Van Roy,et al. A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[72] Filip De Turck,et al. Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks , 2016, ArXiv.

[73] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[74] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[75] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[76] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[77] Ilya Sutskever,et al. Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[78] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[79] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[80] K. Wampler,et al. Optimal gait and form for animal locomotion , 2009, SIGGRAPH 2009.

[81] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[82] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[83] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[84] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[85] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[86] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[87] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[88] Nikolaus Hansen,et al. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[89] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[90] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[91] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[92] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[93] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[94] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[95] David Wingate,et al. Automated Variational Inference in Probabilistic Programming , 2013, ArXiv.

[96] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.