Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Author(s): Schulman, John | Advisor(s): Abbeel, Pieter | Abstract: This thesis is mostly focused on reinforcement learning, which is viewed as an optimization problem: maximize the expected total reward with respect to the parameters of the policy.The first part of the thesis is concerned with making policy gradient methods more sample-efficient and reliable, especially when used with expressive nonlinear function approximators such as neural networks. Chapter 3 considers how to ensure that policy updates lead to monotonic improvement, and how to optimally update a policy given a batch of sampled trajectories. After providing a theoretical analysis, we propose a practical method called trust region policy optimization (TRPO), which performs well on two challenging tasks: simulated robotic locomotion, and playing Atari games using screen images as input. Chapter 4 looks at improving sample complexity of policy gradient methods in a way that is complementary to TRPO: reducing the variance of policy gradient estimates using a state-value function. Using this method, we obtain state-of-the-art results for learning locomotion controllers for simulated 3D robots.Reinforcement learning can be viewed as a special case of optimizing an expectation, and similar optimization problems arise in other areas of machine learning; for example, in variational inference, and when using architectures that include mechanisms for memory and attention. Chapter 5 provides a unifying view of these problems, with a general calculus for obtaining gradient estimators of objectives that involve a mixture of sampled random variables and differentiable operations. This unifying view motivates applying algorithms from reinforcement learning to other prediction and probabilistic modeling problems.

[1]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[8]  Florentin Wörgötter,et al.  Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[9]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[10]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[11]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[12]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[13]  Risto Miikkulainen,et al.  HyperNEAT-GGP: a hyperNEAT-based atari general game player , 2012, GECCO '12.

[14]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[16]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[18]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[19]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[20]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[21]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[22]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[23]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[24]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[25]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[26]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[27]  Shigenobu Kobayashi,et al.  An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[28]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[29]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[30]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[31]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[32]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[33]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[34]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[35]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[36]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[37]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[38]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[39]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[40]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[41]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[42]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[43]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[44]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[45]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[46]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[47]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[48]  Sergey Levine,et al.  Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Max Welling,et al.  Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets , 2014, ICML.

[50]  Kumpati S. Narendra,et al.  Identification and control of dynamical systems using neural networks , 1990, IEEE Trans. Neural Networks.

[51]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[52]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[53]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[54]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[55]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[56]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[57]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[58]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[59]  Paul Glasserman,et al.  Monte Carlo Methods in Financial Engineering , 2003 .

[60]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[61]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[62]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[63]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[64]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[65]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[66]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[67]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[68]  Griewank,et al.  On automatic differentiation , 1988 .

[69]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[70]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[71]  Benjamin Van Roy,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[72]  Filip De Turck,et al.  Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks , 2016, ArXiv.

[73]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[74]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[75]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[76]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[77]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[78]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[79]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[80]  K. Wampler,et al.  Optimal gait and form for animal locomotion , 2009, SIGGRAPH 2009.

[81]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[82]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[83]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[84]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[85]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[86]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[87]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[88]  Nikolaus Hansen,et al.  Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[89]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[90]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[91]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[92]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[93]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[94]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[95]  David Wingate,et al.  Automated Variational Inference in Probabilistic Programming , 2013, ArXiv.

[96]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.