Deep dynamic policy programming for robot control with raw images

Deep reinforcement learning has drawn much attention in robot control since it enables agents to learn control policies from very high dimensional states such as raw images. On the other hand, its dependency upon the availability of a significant quantity of training samples and its fragility in learning makes it difficult to apply for real world robot tasks. To alleviate these issues we propose Deep Dynamic Policy Programming (DDPP), which combines the sample efficiency and smooth policy updates of dynamic policy programming with the contemporary deep reinforcement learning framework. The effectiveness of the proposed method is first demonstrated in a simulation of the robot arm control problem, with comparison to Deep Q-Networks. As validation on a real robot system, DDPP also successfully learned the flipping of a handkerchief with a NEXTAGE humanoid robot using a reduced number of learning samples, whereas Deep Q-Networks failed to learn the task.

[1]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[4]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[5]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[6]  Takamitsu Matsubara,et al.  Pneumatic artificial muscle-driven robot control using local update reinforcement learning , 2017, Adv. Robotics.

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Vicenç Gómez,et al.  Dynamic Policy Programming with Function Approximation , 2011, AISTATS.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[11]  Takamitsu Matsubara,et al.  Kernel dynamic policy programming: Practical reinforcement learning for high-dimensional robots , 2016, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

[12]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[13]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[14]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Geoffrey E. Hinton,et al.  Machine Learning for Aerial Image Labeling , 2013 .

[18]  Stefan Schaal,et al.  Learning Control in Robotics , 2010, IEEE Robotics & Automation Magazine.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[22]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[23]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[25]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..