Variable risk control via stochastic optimization

We present new global and local policy search algorithms suitable for problems with policy-dependent cost variance (or risk), a property present in many robot control tasks. These algorithms exploit new techniques in non-parametric heteroscedastic regression to directly model the policy-dependent distribution of cost. For local search, the learned cost model can be used as a critic for performing risk-sensitive gradient descent. Alternatively, decision-theoretic criteria can be applied to globally select policies to balance exploration and exploitation in a principled way, or to perform greedy minimization with respect to various risk-sensitive criteria. This separation of learning and policy selection permits variable risk control, where risk-sensitivity can be flexibly adjusted and appropriate policies can be selected at runtime without relearning. We describe experiments in dynamic stabilization and manipulation with a mobile manipulator that demonstrate learning of flexible, risk-sensitive policies in very few trials.

[1]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[3]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[4]  Rhodes,et al.  Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games , 1973 .

[5]  P. Whittle Risk-sensitive linear/quadratic/gaussian control , 1981, Advances in Applied Probability.

[6]  Francis L. Merat,et al.  Introduction to robotics: Mechanics and control , 1987, IEEE J. Robotics Autom..

[7]  R. Tibshirani,et al.  Local Likelihood Estimation , 1987 .

[8]  P. Whittle Risk-Sensitive Optimal Control , 1990 .

[9]  Marwan A. Jabri,et al.  Weight perturbation: an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks , 1992, IEEE Trans. Neural Networks.

[10]  D. Dennis,et al.  A statistical method for global optimization , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[11]  Matthias Heger,et al.  Consideration of risk in reinformance learning , 1994, ICML 1994.

[12]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[13]  A. Kacelnik,et al.  Risky Theories—The Effects of Variance on Foraging Decisions , 1996 .

[14]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[15]  Paul W. Goldberg,et al.  Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[16]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[17]  Donald R. Jones,et al.  Global versus local search in constrained optimization of computer models , 1998 .

[18]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[19]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[20]  Andrew G. Barto,et al.  Robot Weightlifting By Direct Policy Search , 2001, IJCAI.

[21]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[22]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[23]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[24]  M. Bateson Recent advances in our understanding of risk-sensitive foraging preferences , 2002, Proceedings of the Nutrition Society.

[25]  Isaac Meilijson,et al.  Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors , 2002, Adapt. Behav..

[26]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[27]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[28]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[29]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[30]  Zoubin Ghahramani,et al.  Variable Noise and Dimensionality Reduction for Sparse Gaussian processes , 2006, UAI.

[31]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  J. O'Doherty,et al.  Reward Value Coding Distinct From Risk Attitude-Related Uncertainty Coding in Human Reward Systems , 2006, Journal of neurophysiology.

[33]  Nando de Freitas,et al.  Active Policy Learning for Robot Planning and Exploration under Uncertainty , 2007, Robotics: Science and Systems.

[34]  Jack L. Treynor,et al.  MUTUAL FUND PERFORMANCE* , 2007 .

[35]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[36]  Wolfram Burgard,et al.  Most likely heteroscedastic Gaussian process regression , 2007, ICML '07.

[37]  Russ Tedrake,et al.  Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms , 2008, NIPS.

[38]  Marcus R. Frean,et al.  Using Gaussian Processes to Optimize Expensive Functions , 2008, Australasian Conference on Artificial Intelligence.

[39]  S. Quartz,et al.  Human Insula Activation Reflects Risk Prediction Errors As Well As Risk , 2008, The Journal of Neuroscience.

[40]  Michael A. Osborne,et al.  Gaussian Processes for Global Optimization , 2008 .

[41]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[42]  Christian Laugier,et al.  The International Journal of Robotics Research (IJRR) - Special issue on ``Field and Service Robotics '' , 2009 .

[43]  Andrew Y. Ng,et al.  Policy search via the signed derivative , 2009, Robotics: Science and Systems.

[44]  Nando de Freitas,et al.  A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot , 2009, Auton. Robots.

[45]  L. Maloney,et al.  Economic decision-making compared with an equivalent motor task , 2009, Proceedings of the National Academy of Sciences.

[46]  Scott Kuindersma,et al.  Dexterous mobility with the uBot-5 mobile manipulator , 2009, 2009 International Conference on Advanced Robotics.

[47]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[48]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[49]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[50]  E. Vázquez,et al.  Convergence properties of the expected improvement algorithm with fixed mean and covariance functions , 2007, 0712.3744.

[51]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[52]  Jun Zhang,et al.  Motor Learning at Intermediate Reynolds Number: Experiments with Policy Gradient on the Flapping Flight of a Rigid Wing , 2010, From Motor Learning to Interaction Learning in Robots.

[53]  Roman Garnett,et al.  Bayesian optimization for sensor set selection , 2010, IPSN '10.

[54]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[55]  Roderic A. Grupen,et al.  Whole-body strategies for mobility and manipulation , 2010 .

[56]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[57]  Daniel A. Braun,et al.  Risk-Sensitive Optimal Feedback Control Accounts for Sensorimotor Behavior under Uncertainty , 2010, PLoS Comput. Biol..

[58]  Hilbert J. Kappen,et al.  Risk Sensitive Path Integral Control , 2010, UAI.

[59]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[60]  Olivier Sigaud,et al.  From Motor Learning to Interaction Learning in Robots , 2010, From Motor Learning to Interaction Learning in Robots.

[61]  Andrew Gordon Wilson,et al.  Generalised Wishart Processes , 2010, UAI.

[62]  Daniel A. Braun,et al.  Risk-Sensitivity in Sensorimotor Control , 2011, Front. Hum. Neurosci..

[63]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[64]  Daniel A. Braun,et al.  Risk-sensitivity and the mean-variance trade-off: decision making in sensorimotor control , 2011, Proceedings of the Royal Society B: Biological Sciences.

[65]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[66]  Howie Choset,et al.  Using response surfaces and expected improvement to optimize snake robot gait parameters , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[67]  Alan Fern,et al.  A Behavior Based Kernel for Policy Search via Bayesian Optimization , 2011 .

[68]  Miguel Lázaro-Gredilla,et al.  Variational Heteroscedastic Gaussian Process Regression , 2011, ICML.

[69]  Scott Kuindersma,et al.  Learning dynamic arm motions for postural recovery , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[70]  D. Lizotte,et al.  An experimental methodology for response surface optimization methods , 2012, J. Glob. Optim..

[71]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[72]  S. Kakade,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012, IEEE Transactions on Information Theory.

[73]  A. Barto,et al.  Variable Risk Dynamic Mobile Manipulation , 2012 .

[74]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[75]  P. Dayan,et al.  Neural Prediction Errors Reveal a Risk-Sensitive Reinforcement-Learning Process in the Human Brain , 2012, The Journal of Neuroscience.

[76]  Darwin G. Caldwell,et al.  Direct policy search reinforcement learning based on particle filtering , 2012, EWRL 2012.

[77]  Scott Kuindersma,et al.  Variational Bayesian Optimization for Runtime Risk-Sensitive Control , 2012, Robotics: Science and Systems.

[78]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.