EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.

[1]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[2]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[3]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[4]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[5]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[6]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[8]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[9]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[10]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[11]  Joelle Pineau,et al.  Bayesian reinforcement learning in continuous POMDPs with application to robot navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[12]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[13]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[14]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[15]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[16]  David J. Fleet,et al.  Optimizing walking controllers for uncertain inputs and environments , 2010, SIGGRAPH 2010.

[17]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[18]  Yuval Tassa,et al.  Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts , 2011, Robotics: Science and Systems.

[19]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[20]  Biao Huang,et al.  System Identification , 2000, Control Theory for Physicists.

[21]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Dotan Di Castro,et al.  Integrating a Partial Model into Model Free Reinforcement Learning , 2012, J. Mach. Learn. Res..

[23]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[24]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[25]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[26]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[27]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[28]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[29]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[30]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[31]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[32]  Zoran Popovic,et al.  Interactive Control of Diverse Complex Characters with Neural Networks , 2015, NIPS.

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[35]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[36]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[39]  Shie Mannor,et al.  Reinforcement Learning in Robust Markov Decision Processes , 2013, Math. Oper. Res..

[40]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .