Improving Behavioural Cloning with Positive Unlabeled Learning

Learning control policies offline from pre-recorded datasets is a promising avenue for solving challenging real-world problems. However, available datasets are typically of mixed quality, with a limited number of the trajectories that we would consider as positive examples; i.e., high-quality demonstrations. Therefore, we propose a novel iterative learning algorithm for identifying expert trajectories in unlabeled mixed-quality robotics datasets given a minimal set of positive examples, surpassing existing algorithms in terms of accuracy. We show that applying behavioral cloning to the resulting filtered dataset outperforms several competitive offline reinforcement learning and imitation learning baselines. We perform experiments on a range of simulated locomotion tasks and on two challenging manipulation tasks on a real robotic system; in these experiments, our method showcases state-of-the-art performance. Our website: \url{https://sites.google.com/view/offline-policy-learning-pubc}.

[1]  B. Schölkopf,et al.  Benchmarking Offline Reinforcement Learning on Real-Robot Hardware , 2023, ICLR.

[2]  Masatoshi Uehara,et al.  A Review of Off-Policy Evaluation in Reinforcement Learning , 2022, ArXiv.

[3]  Haoran Xu,et al.  Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations , 2022, ICML.

[4]  F. Widmaier,et al.  Dexterous robotic manipulation using deep reinforcement learning and knowledge transfer for complex sparse reward‐based tasks , 2022, Expert Syst. J. Knowl. Eng..

[5]  M. Imai,et al.  d3rlpy: An Offline Deep Reinforcement Learning Library , 2021, J. Mach. Learn. Res..

[6]  Sergey Levine,et al.  Offline Reinforcement Learning with Implicit Q-Learning , 2021, ICLR.

[7]  Hyun Oh Song,et al.  Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble , 2021, NeurIPS.

[8]  Francisco Roldan Sanchez,et al.  Solving the Real Robot Challenge Using Deep Reinforcement Learning , 2021, AICS.

[9]  Jonathan Tompson,et al.  Implicit Behavioral Cloning , 2021, CoRL.

[10]  Silvio Savarese,et al.  What Matters in Learning from Offline Human Demonstrations for Robot Manipulation , 2021, CoRL.

[11]  Scott Fujimoto,et al.  A Minimalist Approach to Offline Reinforcement Learning , 2021, NeurIPS.

[12]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[13]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[14]  David Held,et al.  PLAS: Latent Action Space for Offline Reinforcement Learning , 2020, CoRL.

[15]  Ludovic Righetti,et al.  TriFinger: An Open-Source Robot for Learning Dexterity , 2020, CoRL.

[16]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[17]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[18]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[19]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[20]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[21]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[22]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[23]  Joelle Pineau,et al.  Improving Sample Efficiency in Model-Free Reinforcement Learning from Images , 2019, AAAI.

[24]  Che Wang,et al.  BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning , 2019, NeurIPS.

[25]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[26]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[27]  Qing Wang,et al.  Exponentially Weighted Imitation Learning for Batched Historical Data , 2018, NeurIPS.

[28]  Jesse Davis,et al.  Learning from positive and unlabeled data: a survey , 2018, Machine Learning.

[29]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[30]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[31]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[32]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[33]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[37]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[38]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[39]  Gang Niu,et al.  Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[40]  Jamshid Bagherzadeh,et al.  An Evaluation of Two-Step Techniques for Positive-Unlabeled Learning in Text Classification , 2014 .

[41]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[42]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[43]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[45]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[46]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[47]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[48]  L. Sucar,et al.  Markov Decision Processes , 2004, Encyclopedia of Machine Learning and Data Mining.

[49]  A. Gleave,et al.  Stable-Baselines3: Reliable Reinforcement Learning Implementations , 2021, J. Mach. Learn. Res..

[50]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[51]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .