DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning

Deep reinforcement learning (RL) has shown great empirical successes, but suffers from brittleness and sample inefficiency. A potential remedy is to use a previously-trained policy as a source of supervision. In this work, we refer to these policies as teachers and study how to transfer their expertise to new student policies by focusing on data usage. We propose a framework, Data CUrriculum for Reinforcement learning (DCUR), which first trains teachers using online deep RL, and stores the logged environment interaction history. Then, students learn by running either offline RL or by using teacher data in combination with a small amount of self-generated data. DCUR’s central idea involves defining a class of data curricula which, as a function of training time, limits the student to sampling from a fixed subset of the full teacher data. We test teachers and students using state-of-the-art deep RL algorithms across a variety of data curricula. Results suggest that the choice of data curricula significantly impacts student learning, and that it is beneficial to limit the data during early training stages while gradually letting the data availability grow over time. We identify when the student can learn offline and match teacher performance without relying on specialized offline RL algorithms. Furthermore, we show that collecting a small fraction of online data provides complementary benefits with the data curriculum. Supplementary material is available at https://tinyurl.com/teach-dcur.

[1]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[2]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[3]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[4]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[5]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[6]  Pieter Abbeel,et al.  Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[7]  Silvio Savarese,et al.  AC-Teach: A Bayesian Actor-Critic Method for Policy Learning with an Ensemble of Suboptimal Teachers , 2019, CoRL.

[8]  Sergey Levine,et al.  DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[9]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  Lijun Wu,et al.  Learning to Teach with Dynamic Loss Functions , 2018, NeurIPS.

[12]  Oleg O. Sushkov,et al.  A Practical Approach to Insertion with Variable Socket Position Using Deep Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[13]  Yoshua Bengio,et al.  Revisiting Fundamentals of Experience Replay , 2020, ICML.

[14]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[15]  Allan Jabri,et al.  Unsupervised Curricula for Visual Meta-Reinforcement Learning , 2019, NeurIPS.

[16]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[17]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[18]  Carolyn P. Panofsky Vygotsky's Educational Theory in Cultural Context: The Relations of Learning and Student Social Class , 2003 .

[19]  Daniel Seita,et al.  ZPD Teaching Strategies for Deep Reinforcement Learning from Demonstrations , 2019, ArXiv.

[20]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[21]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[22]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[23]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Yuxi Li,et al.  Deep Reinforcement Learning , 2018, Reinforcement Learning for Cyber-Physical Systems.

[26]  David Held,et al.  PLAS: Latent Action Space for Offline Reinforcement Learning , 2020, CoRL.

[27]  J. Stenton,et al.  Learning how to teach. , 1973, Nursing mirror and midwives journal.

[28]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[29]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[30]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[31]  Sergey Levine,et al.  Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , 2020, ICLR.

[32]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[33]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[34]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[35]  Sergey Levine,et al.  COMBO: Conservative Offline Model-Based Policy Optimization , 2021, NeurIPS.

[36]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[37]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[38]  S. Chaiklin The zone of proximal development in Vygotsky's analysis of learning and instruction. , 2003 .

[39]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[40]  Xia Hu,et al.  Dual Policy Distillation , 2020, IJCAI.

[41]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[42]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[45]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[46]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[47]  L. S. Vygotskiĭ,et al.  Mind in society : the development of higher psychological processes , 1978 .

[48]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[49]  Richard Tanburn,et al.  Making Efficient Use of Demonstrations to Solve Hard Exploration Problems , 2019, ICLR.

[50]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[51]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[52]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[53]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[54]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[55]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[56]  Behnam Neyshabur,et al.  When Do Curricula Work? , 2021, ICLR.

[57]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[58]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[59]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[60]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[61]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[62]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[63]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[64]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.