Learning rewards from exploratory demonstrations using probabilistic temporal ranking

This paper addresses a common class of problems where a robot learns to perform a discovery task based on example solutions, or \emph{human demonstrations}. As an example, this work considers the problem of ultrasound scanning, where a demonstration involves an expert adaptively searching for a satisfactory view of internal organs, vessels or tissue and potential anomalies while maintaining optimal contact between the probe and surface tissue. Such problems are often solved by inferring notional \emph{rewards} that, when optimised for, result in a plan that mimics demonstrations. A pivotal assumption, that plans with higher reward should be exponentially more likely, leads to the de facto approach for reward inference in robotics. While this approach of maximum entropy inverse reinforcement learning leads to a general and elegant formulation, it struggles to cope with frequently encountered sub-optimal demonstrations. In this paper, we propose an alternative approach to cope with the class of problems where sub-optimal demonstrations occur frequently. We hypothesise that, in tasks which require discovery, successive states of any demonstration are progressively more likely to be associated with a higher reward. We formalise this \emph{temporal ranking} approach and show that it improves upon maximum-entropy approaches to perform reward inference for autonomous ultrasound scanning, a novel application of learning from demonstration in medical imaging.

[1]  Stuart J. Russell,et al.  imitation: Clean Imitation Learning Implementations , 2022, ArXiv.

[2]  Mykel J. Kochenderfer,et al.  Active preference-based Gaussian process regression for reward learning and optimization , 2020, Robotics: Science and Systems.

[3]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[4]  Aaron D. Ames,et al.  Preference-Based Learning for Exoskeleton Gait Optimization , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Yordan Hristov,et al.  Composing Diverse Policies for Temporally Extended Tasks , 2019, IEEE Robotics and Automation Letters.

[6]  Scott Niekum,et al.  Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations , 2019, CoRL.

[7]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[8]  Masashi Sugiyama,et al.  Imitation Learning from Imperfect Demonstration , 2019, ICML.

[9]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[10]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11]  Marco Pavone,et al.  Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models , 2017, Robotics: Science and Systems.

[12]  Michael Burke,et al.  Rapid Probabilistic Interest Learning from Domain-Specific Pairwise Image Comparisons , 2017 .

[13]  Ruben Martinez-Cantin,et al.  Bayesian optimization with adaptive kernels for robot control , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Joel W. Burdick,et al.  Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[16]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[17]  Peter Bossaerts,et al.  How Humans Solve Complex Problems: The Case of the Knapsack Problem , 2016, Scientific Reports.

[18]  Su-Jin Lee,et al.  Informative Path Planning and Mapping with Multiple UAVs in Wind Fields , 2016, DARS.

[19]  Jan Peters,et al.  Active tactile object exploration with Gaussian processes , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Kyungjae Lee,et al.  Inverse reinforcement learning with leveraged Gaussian processes , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[22]  Shimon Whiteson,et al.  Inverse Reinforcement Learning from Failure , 2016, AAMAS.

[23]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[24]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[25]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[26]  Kian Hsiang Low,et al.  Gaussian Process Planning with Lipschitz Continuous Reward Functions: Towards Unifying Bayesian Optimization, Active Learning, and Beyond , 2015, AAAI.

[27]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[28]  Nassir Navab,et al.  Optimization of ultrasound image quality via visual servoing , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[29]  J A Bagnell,et al.  An Invitation to Imitation , 2015 .

[30]  Fabio Tozeto Ramos,et al.  Bayesian Optimisation for informative continuous path planning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Alexander C. Berg,et al.  Hipster Wars: Discovering Elements of Fashion Styles , 2014, ECCV.

[32]  Ramesh Raskar,et al.  Streetscore -- Predicting the Perceived Safety of One Million Streetscapes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Jan Peters,et al.  An experimental comparison of Bayesian optimization for bipedal locomotion , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Maud Marchal,et al.  Real-time needle detection and tracking using a visually servoed 3D ultrasound probe , 2013, 2013 IEEE International Conference on Robotics and Automation.

[35]  Alessandro Lazaric,et al.  Semi-Supervised Apprenticeship Learning , 2012, EWRL.

[36]  Tao Li,et al.  Maintaining visibility constraints during tele-echography with ultrasound visual servoing , 2012, 2012 IEEE International Conference on Robotics and Automation.

[37]  Gaurav S. Sukhatme,et al.  Branch and bound for informative path planning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[38]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[39]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[40]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[41]  Nando de Freitas,et al.  A Bayesian interactive optimization approach to procedural animation design , 2010, SCA '10.

[42]  Richard Dearden,et al.  Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs , 2010, Artif. Intell..

[43]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[44]  Nando de Freitas,et al.  A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot , 2009, Auton. Robots.

[45]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[46]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[47]  Craig Boutilier,et al.  Preference Elicitation and Generalized Additive Utility , 2006, AAAI.

[48]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[49]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[50]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[51]  Wen-Hong Zhu,et al.  Image-guided control of a robot for medical ultrasound , 2002, IEEE Trans. Robotics Autom..

[52]  A. Gleave,et al.  Stable-Baselines3: Reliable Reinforcement Learning Implementations , 2021, J. Mach. Learn. Res..

[53]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[54]  Andrew G. Barto,et al.  Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[55]  Hiroaki Sugiyama,et al.  Preference-learning based Inverse Reinforcement Learning for Dialog Control , 2012, INTERSPEECH.

[56]  Kaicheng Liang,et al.  Three-dimensional ultrasound guidance of autonomous robotic breast biopsy: feasibility study. , 2010, Ultrasound in medicine & biology.

[57]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[58]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[59]  Radford M. Neal Priors for Infinite Networks , 1996 .

[60]  L. Thurstone A law of comparative judgment. , 1994 .