SAIL: Simulation-Informed Active In-the-Wild Learning

Robots in real-world environments may need to adapt context-specific behaviors learned in one environment to new environments with new constraints. In many cases, copresent humans can provide the robot with information, but it may not be safe for them to provide hands-on demonstrations and there may not be a dedicated supervisor to provide constant feedback. In this work we present the SAIL (Simulation-Informed Active In-the-Wild Learning) algorithm for learning new approaches to manipulation skills starting from a single demonstration. In this three-step algorithm, the robot simulates task execution to choose new potential approaches; collects unsupervised data on task execution in the target environment; and finally, chooses informative actions to show to co-present humans and obtain labels. Our approach enables a robot to learn new ways of executing two different tasks by using success/failure labels obtained from naïve users in a public space, performing 496 manipulation actions and collecting 163 labels from users in the wild over six 45-minute to 1-hour deployments. We show that classifiers based low-level sensor data can be used to accurately distinguish between successful and unsuccessful motions in a multi-step task ($\mathbf{p} < 0.005$), even when trained in the wild. We also show that using the sensor data to choose which actions to sample is more effective than choosing the least-sampled action.

[1]  Emre Ugur,et al.  Goal emulation and planning in perceptual space using learned affordances , 2011, Robotics Auton. Syst..

[2]  Subbarao Kambhampati,et al.  Generating diverse plans to handle unknown and partially known user preferences , 2012, Artif. Intell..

[3]  Zhou Yu,et al.  Are you messing with me? Querying about the sincerity of interactions in the open world , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[4]  Anca D. Dragan,et al.  Learning from Physical Human Corrections, One Feature at a Time , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[5]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[6]  Patrick Jaillet,et al.  Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs) , 2017, J. Artif. Intell. Res..

[7]  Maya Cakmak,et al.  Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[8]  Anna Helena Reali Costa,et al.  A Geometric Approach to Find Nondominated Policies to Imprecise Reward MDPs , 2011, ECML/PKDD.

[9]  Brian Scassellati,et al.  Discovering task constraints through observation and active learning , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Jan Peters,et al.  Learning movement primitive libraries through probabilistic segmentation , 2017, Int. J. Robotics Res..

[11]  Yann Chevaleyre,et al.  Advantage based value iteration for Markov decision processes with unknown rewards , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[12]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[13]  Andrea Lockerd Thomaz,et al.  Simultaneously learning actions and goals from demonstration , 2016, Auton. Robots.

[14]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[15]  Craig Boutilier,et al.  Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies , 2010, AAAI.

[16]  Maya Cakmak,et al.  Designing robot learners that ask good questions , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  Peter Stone,et al.  Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[18]  Aude Billard,et al.  Donut as I do: Learning from failed demonstrations , 2011, 2011 IEEE International Conference on Robotics and Automation.

[19]  Manfred Tscheligi,et al.  Robots asking for directions — The willingness of passers-by to support robots , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[20]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[21]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[22]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[23]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[24]  Bruno Zanuttini,et al.  An Experimental Study of Advice in Sequential Decision-Making Under Uncertainty , 2018, AAAI.

[25]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[26]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[27]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.