Guided Policy Search with Delayed Senor Measurements

Guided policy search is a method for reinforcement learning that trains a general policy for accomplishing a given task by guiding the learning of the policy with multiple guiding distributions. Guided policy search relies on learning an underlying dynamical model of the environment and then, at each iteration of the algorithm, using that model to gradually improve the policy. This model, though, often makes the assumption that the environment dynamics are markovian, e.g., depend only on the current state and control signal. In this paper we apply guided policy search to a problem with non-markovian dynamics. Specifically, we apply it to the problem of pouring a precise amount of liquid from a cup into a bowl, where many of the sensor measurements experience non-trivial amounts of delay. We show that, with relatively simple state augmentation, guided policy search can be extended to non-markovian dynamical systems, where the non-markovianess is caused by delayed sensor readings.

[1]  Huibert Kwakernaak,et al.  Linear Optimal Control Systems , 1972 .

[2]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[3]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[4]  Ken'ichi Yano,et al.  Sloshing suppression control of automatic pouring robot by hybrid shape approach , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[5]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[6]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[7]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[8]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[9]  Masayuki Inaba,et al.  Vision based behavior verification system of humanoid robot for daily environment tasks , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[10]  George Konidaris,et al.  Autonomous Robot Skill Acquisition , 2008, AAAI.

[11]  P. Deb Finite Mixture Models , 2008 .

[12]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[13]  Scott Kuindersma,et al.  Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[14]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[16]  Maya Cakmak,et al.  Designing robot learners that ask good questions , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[18]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[19]  Scott Niekum An Integrated System for Learning Multi-Step Robotic Tasks from Unstructured Demonstrations , 2013, AAAI Spring Symposium: Designing Intelligent Robots.

[20]  Scott Niekum,et al.  Semantically Grounded Learning from Unstructured Demonstrations , 2013 .

[21]  Carme Torras,et al.  Force-based robot learning of pouring skills using parametric hidden Markov models , 2013, 9th International Workshop on Robot Motion and Control.

[22]  Emanuel Todorov,et al.  Combining the benefits of function approximation and trajectory optimization , 2014, Robotics: Science and Systems.

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  Krishnanand N. Kaipa,et al.  Incorporating Failure-to-Success Transitions in Imitation Learning for a Dynamic Pouring Task , 2014 .

[25]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Christopher G. Atkeson,et al.  Differential dynamic programming with temporally decomposed dynamics , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[27]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..