Reducing hardware experiments for model learning and policy optimization

Conducting hardware experiment is often expensive in various aspects such as potential damage to the robot and the number of people required to operate the robot safely. Computer simulation is used in place of hardware in such cases, but it suffers from so-called simulation bias in which policies tuned in simulation do not work on hardware due to differences in the two systems. Model-free methods such as Q-Learning, on the other hand, do not require a model and therefore can avoid this issue. However, these methods typically require a large number of experiments, which may not be realistic for some tasks such as humanoid robot balancing and locomotion. This paper presents an iterative approach for learning hardware models and optimizing policies with as few hardware experiments as possible. Instead of learning the model from scratch, our method learns the difference between a simulation model and hardware. We then optimize the policy based on the learned model in simulation. The iterative approach allows us to collect wider range of data for model refinement while improving the policy.

[1]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[2]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[3]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[4]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[5]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[6]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[7]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[8]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[9]  Jonathan P. How,et al.  Reinforcement learning with multi-fidelity simulators , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[10]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[11]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[12]  Katsu Yamane,et al.  Practical kinematic and dynamic calibration methods for force-controlled humanoid robots , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[13]  Jun Morimoto,et al.  Learning from demonstration and adaptation of biped locomotion , 2004, Robotics Auton. Syst..

[14]  Jun Morimoto,et al.  Trajectory-model-based reinforcement learning: Application to bimanual humanoid motor learning with a closed-chain constraint , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[15]  Wisama Khalil,et al.  Modeling, Identification and Control of Robots , 2003 .

[16]  Katsu Yamane,et al.  Universal balancing controller for robust lateral stabilization of bipedal robots in dynamic, unstable environments , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Dieter Fox,et al.  Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[18]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[19]  Jun Morimoto,et al.  Improving humanoid locomotive performance with learnt approximated dynamics via Gaussian processes for regression , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Jun Morimoto,et al.  Model-Based Policy Gradients with Parameter-Based Exploration by Least-Squares Conditional Density Estimation , 2013, Neural Networks.

[21]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[22]  Edward Lloyd Snelson,et al.  Flexible and efficient Gaussian process models for machine learning , 2007 .

[23]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.