AutoEG: Automated Experience Grafting for Off-Policy Deep Reinforcement Learning

Deep reinforcement learning (RL) algorithms frequently require prohibitive interaction experience to ensure the quality of learned policies. The limitation is partly because the agent cannot learn much from the many low-quality trials in early learning phase, which results in low learning rate. Focusing on addressing this limitation, this paper makes a twofold contribution. First, we develop an algorithm, called Experience Grafting (EG), to enable RL agents to reorganize segments of the few high-quality trajectories from the experience pool to generate many synthetic trajectories while retaining the quality. Second, building on EG, we further develop an AutoEG agent that automatically learns to adjust the grafting-based learning strategy. Results collected from a set of six robotic control environments show that, in comparison to a standard deep RL algorithm (DDPG), AutoEG increases the speed of learning process by at least 30%.

[1]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[2]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  Isabelle Guyon,et al.  Taking Human out of Learning Applications: A Survey on Automated Machine Learning , 2018, 1810.13306.

[5]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[6]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[7]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[8]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[9]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[10]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[11]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[12]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[13]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[14]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[15]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[17]  Shiqi Zhang,et al.  Learning to Dialogue via Complex Hindsight Experience Replay , 2018, ArXiv.

[18]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[19]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[20]  Shiqi Zhang,et al.  Goal-Oriented Dialogue Policy Learning from Failures , 2018, AAAI.

[21]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[22]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[23]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[24]  V. Kaul,et al.  Planning , 2012 .

[25]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[26]  R. Fildes Journal of the American Statistical Association : William S. Cleveland, Marylyn E. McGill and Robert McGill, The shape parameter for a two variable graph 83 (1988) 289-300 , 1989 .

[27]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[30]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[31]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[32]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[33]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.