论文信息 - Planning Approximate Exploration Trajectories for Model-Free Reinforcement Learning in Contact-Rich Manipulation - 字舞流文

Planning Approximate Exploration Trajectories for Model-Free Reinforcement Learning in Contact-Rich Manipulation

Recent progress in deep reinforcement learning has enabled simulated agents to learn complex behavior policies from scratch, but their data complexity often prohibits real-world applications. The learning process can be sped up by expert demonstrations but those can be costly to acquire. We demonstrate that it is possible to employ model-free deep reinforcement learning combined with planning to quickly generate informative data for a manipulation task. In particular, we use an approximate trajectory optimization approach for global exploration based on an upper confidence bound of the advantage function. The advantage is approximated by a network for Q-learning with separately updated streams for state value and advantage that allows ensembles to approximate model uncertainty for one stream only. We evaluate our method on new extensions to the classical peg-in-hole task, one of which is only solvable by active usage of contacts between peg tips and holes. The experimental evaluation suggests that our method explores more relevant areas of the environment and finds exemplar solutions faster—both on a real robot and in simulation. Combining our exploration with learning from demonstration outperforms state-of-the-art model-free reinforcement learning in terms of convergence speed for contact-rich manipulation tasks.

Marc Toussaint | Daniel Hennes | Zhongyu Lou | Sabrina Hoppe | Marc Toussaint | Daniel Hennes | Sabrina Hoppe | Zhongyu Lou

[1] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[2] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3] Peter Englert,et al. Learning manipulation skills from a single demonstration , 2018, Int. J. Robotics Res..

[4] M. J. D. Powell,et al. An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[5] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[6] Marc Peter Deisenroth,et al. Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[7] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[8] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[9] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.

[10] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[11] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[12] Thomas J. Walsh,et al. Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[13] Sergey Levine,et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[14] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[15] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[16] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[17] W. Marsden. I and J , 2012 .

[18] Marcin Andrychowicz,et al. Parameter Space Noise for Exploration , 2017, ICLR.

[19] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[20] Tsuyoshi Murata,et al. {m , 1934, ACML.

[21] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[22] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[23] Nima Fazeli,et al. Learning Data-Efficient Rigid-Body Contact Models: Case Study of Planar Impact , 2017, CoRL.

[24] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[25] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[26] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[27] Sham M. Kakade,et al. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[28] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[29] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.