Benchmarking Deep Reinforcement Learning for Continuous Control

Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released at this https URL in order to facilitate experimental reproducibility and to encourage adoption by other researchers.

[1]  Bernard Widrow,et al.  Pattern Recognition and Adaptive Control , 1964, IEEE Transactions on Applications and Industry.

[2]  E. Purcell Life at Low Reynolds Number , 2008 .

[3]  K. Furuta,et al.  Computer control of a double inverted pendulum , 1978 .

[4]  Seshashayee S. Murthy,et al.  3D balance in legged locomotion: modeling and simulation for the one-legged case (abstract only) , 1984, COMG.

[5]  Seshashayee S. Murthy,et al.  3-D balance in legged locomotion: modeling and simulation for the one-legged case , 1986, Workshop on Motion.

[6]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[7]  Jessica K. Hodgins,et al.  Animation of dynamic legged locomotion , 1991, SIGGRAPH.

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[11]  Mark W. Spong,et al.  Swinging up the Acrobot: an example of intelligent control , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[12]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[13]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[14]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[17]  Risto Miikkulainen,et al.  2-D Pole Balancing with Recurrent Evolutionary Networks , 1998 .

[18]  H. Kimura,et al.  Stochastic real-valued reinforcement learning to solve a nonlinear control problem , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[19]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[20]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[21]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[22]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[23]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[24]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[25]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[26]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[27]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[28]  Rémi Coulom,et al.  Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[29]  András Lörincz,et al.  MDPs: Learning in Varying Environments , 2003, J. Mach. Learn. Res..

[30]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[31]  Stefan Schaal,et al.  Policy Gradient Methods for Robot Control , 2003 .

[32]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[33]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[34]  Nate Kohl,et al.  Reinforcement Learning Benchmarks and Bake-offs II A workshop at the 2005 NIPS conference , 2005 .

[35]  Peter Stone,et al.  Keepaway Soccer: From Machine Learning Testbed to Benchmark , 2005, RoboCup.

[36]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[37]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[38]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Wulfram Gerstner,et al.  Dynamical principles for neuroscience and intelligent biomimetic devices , 2006 .

[40]  Christos Dimitrakakis,et al.  Beliefbox: A framework for statistical methods in sequential decision making , 2007 .

[41]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[42]  P. Wawrzynski,et al.  Learning to Control a 6-Degree-of-Freedom Walking Robot , 2007, EUROCON 2007 - The International Conference on "Computer as a Tool".

[43]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[44]  Geoffrey Zweig,et al.  Automated directory assistance system - from theory to practice , 2007, INTERSPEECH.

[45]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[46]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[47]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[48]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[49]  R. Murray,et al.  A Case Study in Approximate Linearization: The Acrobot Example , 2010 .

[50]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[51]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[52]  Tsukasa Ogasawara,et al.  SkyAI: Highly modularized reinforcement learning library , 2010, 2010 10th IEEE-RAS International Conference on Humanoid Robots.

[53]  Yuval Tassa,et al.  Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts , 2011, Robotics: Science and Systems.

[54]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[57]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[58]  Steffen Udluft,et al.  Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks , 2012, Neural Networks: Tricks of the Trade.

[59]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[60]  Pawel Wawrzynski,et al.  dotRL: A platform for rapid Reinforcement Learning methods development and validation , 2013, 2013 Federated Conference on Computer Science and Information Systems.

[61]  Donald Michie,et al.  BOXES: AN EXPERIMENT IN ADAPTIVE CONTROL , 2013 .

[62]  Peter Stone,et al.  The Open-Source TEXPLORE Code Release for Reinforcement Learning on Robots , 2013, RoboCup.

[63]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[64]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[65]  Christos Dimitrakakis,et al.  The reinforcement learning competition , 2014 .

[66]  Christos Dimitrakakis,et al.  The Reinforcement Learning Competition 2014 , 2014, AI Mag..

[67]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[68]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[69]  Ubbo Visser,et al.  RLLib: C++ Library to Predict, Control, and Represent Learnable Knowledge Using On/Off Policy Reinforcement Learning , 2015, RoboCup.

[70]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[71]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[72]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[73]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[74]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[75]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[76]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[77]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[78]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.