Modular transfer learning with transition mismatch compensation for excessive disturbance rejection

Underwater robots in shallow waters usually suffer from strong wave forces, which may frequently exceed robot's control constraints. Learning-based controllers are suitable for disturbance rejection control, but the excessive disturbances heavily affect the state transition in Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP). Also, pure learning procedures on targeted system may encounter damaging exploratory actions or unpredictable system variations, and training exclusively on a prior model usually cannot address model mismatch from the targeted system. In this paper, we propose a transfer learning framework that adapts a control policy for excessive disturbance rejection of an underwater robot under dynamics model mismatch. A modular network of learning policies is applied, composed of a Generalized Control Policy (GCP) and an Online Disturbance Identification Model (ODI). GCP is first trained over a wide array of disturbance waveforms. ODI then learns to use past states and actions of the system to predict the disturbance waveforms which are provided as input to GCP (along with the system state). A transfer reinforcement learning algorithm using Transition Mismatch Compensation (TMC) is developed based on the modular architecture, that learns an additional compensatory policy through minimizing mismatch of transitions predicted by the two dynamics models of the source and target tasks. We demonstrated on a pose regulation task in simulation that TMC is able to successfully reject the disturbances and stabilize the robot under an empirical model of the robot system, meanwhile improve sample efficiency.

[1]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[2]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[3]  Geoffrey A. Hollinger,et al.  Model Predictive Control for Underwater Robots in Ocean Waves , 2017, IEEE Robotics and Automation Letters.

[4]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[5]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[6]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[7]  Junku Yuh,et al.  Underwater Robots , 2012, Springer Handbook of Robotics, 2nd Ed..

[8]  Gwyn Griffiths,et al.  Technology and applications of autonomous underwater vehicles , 2002 .

[9]  Zhiqiang Gao,et al.  On the centrality of disturbance rejection in automatic control. , 2014, ISA transactions.

[10]  Paul Zarchan,et al.  Fundamentals of Kalman Filtering: A Practical Approach , 2001 .

[11]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[12]  Yuanli Cai,et al.  Nonlinear disturbance observer-based model predictive control for a generic hypersonic vehicle , 2016, J. Syst. Control. Eng..

[13]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Hod Lipson,et al.  Nonlinear system identification using coevolution of models and tests , 2005, IEEE Transactions on Evolutionary Computation.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[17]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[18]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[19]  Marko Bacic,et al.  Model predictive control , 2003 .

[20]  Sergey Levine,et al.  Adapting Deep Visuomotor Representations with Weak Pairwise Constraints , 2015, WAFR.

[21]  E. Bai,et al.  Block Oriented Nonlinear System Identification , 2010 .

[22]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[23]  Peter J. Gawthrop,et al.  A nonlinear disturbance observer for robotic manipulators , 2000, IEEE Trans. Ind. Electron..

[24]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[25]  Danica Kragic,et al.  Reinforcement Learning for Pivoting Task , 2017, ArXiv.

[26]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[27]  Razvan Pascanu,et al.  Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[28]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[29]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[30]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[31]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[33]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[34]  Dikai Liu,et al.  Kinematic control of an Autonomous Underwater Vehicle-Manipulator System (AUVMS) using autoregressive prediction of vehicle motion and Model Predictive Control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[36]  Wenjie Lu,et al.  A Case Study: Modeling of A Passive Flexible Link on A Floating Platform for Intervention Tasks , 2018, 2018 13th World Congress on Intelligent Control and Automation (WCICA).

[37]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[39]  Jun Yang,et al.  Disturbance Observer-Based Control: Methods and Applications , 2014 .

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[42]  Jeffrey M. Forbes,et al.  Representations for learning control policies , 2002 .

[43]  Lei Guo,et al.  Neural Network-Based DOBC for a Class of Nonlinear Systems With Unmatched Disturbances , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[44]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Peter Stone,et al.  Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[46]  Lei Guo,et al.  How much uncertainty can be dealt with by feedback? , 2000, IEEE Trans. Autom. Control..

[47]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[48]  Michel Gevers,et al.  System identification without Lennart Ljung : what would have been different ? , 2006 .

[49]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[50]  Zheng Yan,et al.  DOB-Net: Actively Rejecting Unknown Excessive Time-Varying Disturbances , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[51]  Daan Wierstra,et al.  Recurrent Environment Simulators , 2017, ICLR.

[52]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[53]  Emanuel Todorov,et al.  Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system , 2018, 2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR).

[54]  Carlos Bordons Alba,et al.  Model Predictive Control , 2012 .

[55]  Mikhail Pavlov,et al.  Deep Attention Recurrent Q-Network , 2015, ArXiv.

[56]  Dikai Liu,et al.  Excessive disturbance rejection control of autonomous underwater vehicle using reinforcement learning , 2018 .

[57]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[58]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[59]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[60]  Peter Stone,et al.  Intrinsically motivated model learning for developing curious robots , 2017, Artif. Intell..

[61]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[62]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[63]  Ivan Koryakovskiy,et al.  Model-Plant Mismatch Compensation Using Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.