Bayesian Residual Policy Optimization: : Scalable Bayesian Reinforcement Learning with Clairvoyant Experts

Informed and robust decision making in the face of uncertainty is critical for robots that perform physical tasks alongside people. We formulate this as Bayesian Reinforcement Learning over latent Markov Decision Processes (MDPs). While Bayes-optimality is theoretically the gold standard, existing algorithms do not scale well to continuous state and action spaces. Our proposal builds on the following insight: in the absence of uncertainty, each latent MDP is easier to solve. We first obtain an ensemble of experts, one for each latent MDP, and fuse their advice to compute a baseline policy. Next, we train a Bayesian residual policy to improve upon the ensemble's recommendation and learn to reduce uncertainty. Our algorithm, Bayesian Residual Policy Optimization (BRPO), imports the scalability of policy gradient methods and task-specific expert skills. BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods.

[1]  Gireeja Ranade,et al.  Data-driven planning via imitation learning , 2017, Int. J. Robotics Res..

[2]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[3]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[4]  Yao Liu,et al.  PAC Continuous State Online Multitask Reinforcement Learning with Identification , 2016, AAMAS.

[5]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[6]  Nan Rong,et al.  What makes some POMDP problems easy to approximate? , 2007, NIPS.

[7]  S. Shankar Sastry,et al.  Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[8]  Jonathan Baxter Theoretical Models of Learning to Learn , 2020 .

[9]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[10]  Sergey Levine,et al.  PLATO: Policy learning using adaptive trajectory optimization , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Sergey Levine,et al.  Residual Reinforcement Learning for Robot Control , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[12]  Angela P. Schoellig,et al.  Learning-based nonlinear model predictive control to improve vision-based mobile robot path-tracking in challenging outdoor environments , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[13]  S. Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[14]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[15]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[16]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[18]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[19]  Min Chen,et al.  POMDP-lite for robust robot planning under uncertainty , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[21]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[22]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[23]  Angela P. Schoellig,et al.  Safe and robust learning control with Gaussian processes , 2015, 2015 European Control Conference (ECC).

[24]  David Hsu,et al.  LeTS-Drive: Driving in a Crowd by Learning from Tree Search , 2019, Robotics: Science and Systems.

[25]  Mykel J. Kochenderfer,et al.  Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces , 2017, ICAPS.

[26]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[28]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[29]  Siddhartha S. Srinivasa,et al.  MuSHR: A Low-Cost, Open-Source Robotic Racecar for Education and Research , 2019, ArXiv.

[30]  Scott Sanner,et al.  Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach , 2018, NeurIPS.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Jonathan Baxter,et al.  Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[33]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[34]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[35]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[36]  Leslie Pack Kaelbling,et al.  Residual Policy Learning , 2018, ArXiv.

[37]  Siddhartha S. Srinivasa,et al.  Bayesian Policy Optimization for Model Uncertainty , 2018, ICLR.

[38]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[39]  Angela P. Schoellig,et al.  Conservative to confident: Treating uncertainty robustly within Learning-Based Control , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[41]  Yee Whye Teh,et al.  Meta-learning of Sequential Strategies , 2019, ArXiv.

[42]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[43]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[44]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[45]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[47]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[48]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[49]  Neil C. Rabinowitz Meta-learners' learning dynamics are unlike learners' , 2019, ArXiv.

[50]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.