Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the modified reward function penalizes the agent for visiting states and taking actions in the source domain which are not possible in the target domain. Said another way, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

[1]  Lorenzo Fagiano,et al.  Adaptive model predictive control for constrained linear systems , 2013, 2013 European Control Conference (ECC).

[2]  J. Andrew Bagnell,et al.  Planning and Execution using Inaccurate Models with Provable Guarantees , 2020, Robotics: Science and Systems.

[3]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[4]  Lucas Theis,et al.  Amortised MAP Inference for Image Super-resolution , 2016, ICLR.

[5]  J. Webster,et al.  Wiley Encyclopedia of Electrical and Electronics Engineering , 2010 .

[6]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[7]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[8]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[9]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[10]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11]  Christopher Burgess,et al.  DARLA: Improving Zero-Shot Transfer in Reinforcement Learning , 2017, ICML.

[12]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[13]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[14]  Mehryar Mohri,et al.  Domain adaptation and sample bias correction theory and algorithm for regression , 2014, Theor. Comput. Sci..

[15]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[16]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[17]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[18]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[19]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[21]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[22]  Ferenc Huszár,et al.  Variational Inference using Implicit Distributions , 2017, ArXiv.

[23]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[24]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[25]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[26]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[27]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[28]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[29]  Kostas E. Bekris,et al.  Fast Model Identification via Physics Engines for Data-Efficient Policy Search , 2017, IJCAI.

[30]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[31]  Masatoshi Uehara,et al.  Generative Adversarial Nets from a Density Ratio Estimation Perspective , 2016, 1610.02920.

[32]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[33]  Kostas E. Bekris,et al.  Model Identification via Physics Engines for Improved Policy Search , 2017, ArXiv.

[34]  Tamer Basar,et al.  Dual Control Theory , 2001 .

[35]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[36]  Fabio Tozeto Ramos,et al.  Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference , 2018, ArXiv.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[39]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[40]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[41]  Klaus-Robert Müller,et al.  Model Selection Under Covariate Shift , 2005, ICANN.

[42]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[43]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[44]  Wouter M. Kouw,et al.  A Review of Domain Adaptation without Target Labels , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Yoav Goldberg,et al.  Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation , 2018, ICML.

[46]  Jonathan P. How,et al.  Reinforcement learning with multi-fidelity simulators , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[47]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[48]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[49]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[50]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[51]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[52]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[53]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[54]  Yannick Schroecker,et al.  Universal Value Density Estimation for Imitation Learning and Goal-Conditioned Reinforcement Learning , 2020, ArXiv.

[55]  A. Isidori,et al.  Adaptive control of linearizable systems , 1989 .

[56]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[57]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[58]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[59]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[60]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[61]  Sergey Levine,et al.  Learning to Adapt: Meta-Learning for Model-Based Control , 2018, ArXiv.

[62]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[63]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[65]  Yaoliang Yu,et al.  Analysis of Kernel Mean Matching under Covariate Shift , 2012, ICML.

[66]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[67]  Patrick MacAlpine,et al.  Humanoid robots learning to walk faster: from the real world to simulation and back , 2013, AAMAS.

[68]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[69]  Takafumi Kanamori,et al.  Conditional Density Estimation via Least-Squares Density Ratio Estimation , 2010, AISTATS.

[70]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[71]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[72]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[73]  Björn Wittenmark,et al.  Adaptive Dual Control Methods: An Overview , 1995 .

[74]  Alex Bewley,et al.  Addressing appearance change in outdoor robotics with adversarial domain adaptation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[75]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[76]  Brian C. Lovell,et al.  Domain Adaptation on the Statistical Manifold , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Evangelos Theodorou,et al.  Model Predictive Path Integral Control using Covariance Variable Importance Sampling , 2015, ArXiv.

[78]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[79]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[80]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[81]  Sergey Levine,et al.  Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning , 2017, ICLR.

[82]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[83]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[84]  Trevor Darrell,et al.  FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation , 2016, ArXiv.

[85]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[86]  Byron Boots,et al.  Simulation-based design of dynamic controllers for humanoid balancing , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[87]  Richard E. Turner,et al.  Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control , 2016, ICML.

[88]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[89]  Athanasios S. Polydoros,et al.  Survey of Model-Based Reinforcement Learning: Applications on Robotics , 2017, J. Intell. Robotic Syst..

[90]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[91]  Tinne Tuytelaars,et al.  Unsupervised Visual Domain Adaptation Using Subspace Alignment , 2013, 2013 IEEE International Conference on Computer Vision.