Bayesian Controller Fusion: Leveraging Control Priors in Deep Reinforcement Learning for Robotics

We present Bayesian Controller Fusion (BCF): a hybrid control strategy that combines the strengths of traditional hand-crafted controllers and model-free deep reinforcement learning (RL). BCF thrives in the robotics domain, where reliable but suboptimal control priors exist for many tasks, but RL from scratch remains unsafe and data-inefficient. By fusing uncertainty-aware distributional outputs from each system, BCF arbitrates control between them, exploiting their respective strengths. We study BCF on two real-world robotics tasks involving navigation in a vast and long-horizon environment, and a complex reaching task that involves manipulability maximisation. For both these domains, there exist simple handcrafted controllers that can solve the task at hand in a risk-averse manner but do not necessarily exhibit the optimal solution given limitations in analytical modelling, controller miscalibration and task variation. As exploration is naturally guided by the prior in the early stages of training, BCF accelerates learning, while substantially improving beyond the performance of the control prior, as the policy gains more experience. More importantly, given the risk-aversity of the control prior, BCF ensures safe exploration and deployment, where the control prior naturally dominates the action distribution in states unknown to the policy. We additionally show BCF’s applicability to the zeroshot sim-to-real setting and its ability to deal with out-ofdistribution states in the real-world. BCF is a promising approach for combining the complementary strengths of deep RL and traditional robotic control, surpassing what either can achieve independently. The code and supplementary video material are made publicly available at https://krishanrana.github.io/bcf.

[1]  Jian Zhang,et al.  Structured Control Nets for Deep Reinforcement Learning , 2018, ICML.

[2]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[3]  Daniel E. Whitney,et al.  Resolved Motion Rate Control of Manipulators and Human Prostheses , 1969 .

[4]  Danica Kragic,et al.  Deep predictive policy training using reinforcement learning , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Marko Bacic,et al.  Model predictive control , 2003 .

[6]  Sen Wang,et al.  Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[8]  Tsuneo Yoshikawa,et al.  Manipulability of Robotic Mechanisms , 1985 .

[9]  D. Hassabis,et al.  Neuroscience-Inspired Artificial Intelligence , 2017, Neuron.

[10]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[11]  Swarat Chaudhuri,et al.  Control Regularization for Reduced Variance Reinforcement Learning , 2019, ICML.

[12]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[13]  Charles W. Warren,et al.  Global path planning using artificial potential fields , 1989, Proceedings, 1989 International Conference on Robotics and Automation.

[14]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[15]  P. Dayan,et al.  Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[18]  Andrew J. Davison,et al.  PyRep: Bringing V-REP to Deep Robot Learning , 2019, ArXiv.

[19]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[20]  Petter Ögren,et al.  How Behavior Trees Modularize Hybrid Control Systems and Generalize Sequential Behavior Compositions, the Subsumption Architecture, and Decision Trees , 2017, IEEE Transactions on Robotics.

[21]  Dieter Fox,et al.  Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[23]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[25]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[26]  Oussama Khatib,et al.  Springer Handbook of Robotics , 2007, Springer Handbooks.

[27]  Wenbing Huang,et al.  Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance , 2019, AAAI.

[28]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[29]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[30]  Oussama Khatib,et al.  Real-Time Obstacle Avoidance for Manipulators and Mobile Robots , 1985, Autonomous Robot Vehicles.

[31]  Jackie Kay,et al.  Learning Dexterous Manipulation from Suboptimal Experts , 2020, ArXiv.

[32]  Sergey Levine,et al.  Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[33]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[34]  Atil Iscen,et al.  Policies Modulating Trajectory Generators , 2018, CoRL.

[35]  Peter Corke,et al.  A Purely-Reactive Manipulability-Maximising Motion Controller. , 2020 .

[36]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[37]  Daniel Kudenko,et al.  Theoretical and Empirical Analysis of Reward Shaping in Reinforcement Learning , 2009, 2009 International Conference on Machine Learning and Applications.

[38]  J. Maxwell I. On governors , 1868, Proceedings of the Royal Society of London.

[39]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[40]  Michael Milford,et al.  Multiplicative Controller Fusion: Leveraging Algorithmic Priors for Sample-efficient Reinforcement Learning and Safe Sim-To-Real Transfer , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[41]  Sarosh H. Patel,et al.  Manipulator Performance Measures - A Comprehensive Literature Survey , 2015, J. Intell. Robotic Syst..

[42]  Sergey Levine,et al.  How to train your robot with deep reinforcement learning: lessons we have learned , 2021, Int. J. Robotics Res..

[43]  Leslie Pack Kaelbling,et al.  Residual Policy Learning , 2018, ArXiv.

[44]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[45]  Auke Jan Ijspeert,et al.  Central pattern generators for locomotion control in animals and robots: A review , 2008, Neural Networks.

[46]  B. Balleine,et al.  The Role of Learning in the Operation of Motivational Systems , 2002 .

[47]  J. Doyle,et al.  Essentials of Robust Control , 1997 .

[48]  Peter Corke,et al.  Not your grandmother’s toolbox – the Robotics Toolbox reinvented for Python , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[50]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[51]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[52]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[53]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[54]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[55]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[56]  Niko Sünderhauf,et al.  Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies For Deployment in Unknown Environments , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Sergey Levine,et al.  Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[58]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[59]  Nicolas Heess,et al.  Composing Entropic Policies using Divergence Correction , 2018, ICML.

[60]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[61]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[62]  Yoram Koren,et al.  Potential field methods and their inherent limitations for mobile robot navigation , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.