Bayesian Nonparametric Reward Learning From Demonstration

Learning from demonstration provides an attractive solution to the problem of teaching autonomous systems how to perform complex tasks. Reward learning from demonstration is a promising method of inferring a rich and transferable representation of the demonstrator's intents, but current algorithms suffer from intractability and inefficiency in large domains due to the assumption that the demonstrator is maximizing a single reward function throughout the whole task. This paper takes a different perspective by assuming that the reward function behind an unsegmented demonstration is actually composed of several distinct subtasks chained together. Leveraging this assumption, a Bayesian nonparametric reward-learning framework is presented that infers multiple subgoals and reward functions within a single unsegmented demonstration. The new framework is developed for discrete state spaces and also general continuous demonstration domains using Gaussian process reward representations. The algorithm is shown to have both performance and computational advantages over existing inverse reinforcement learning methods. Experimental results are given in both cases, demonstrating the ability to learn challenging maneuvers from demonstration on a quadrotor and a remote-controlled car.

[1]  Stochastic Relaxation , 2014, Computer Vision, A Reference Guide.

[2]  Chris L. Baker,et al.  Action understanding as inverse planning , 2009, Cognition.

[3]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[4]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[5]  Alexander Zelinsky,et al.  Programing by Demonstration: Coping with Suboptimal Teaching Actions , 2003 .

[6]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[7]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[8]  Marc Toussaint,et al.  Optimization of sequential attractor-based movement for compact behaviour generation , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[9]  Manuela M. Veloso,et al.  Multi-thresholded approach to demonstration selection for interactive robot learning , 2008, 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  Guido Bugmann,et al.  Mobile robot programming using natural language , 2002, Robotics Auton. Syst..

[11]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[12]  Illah R. Nourbakhsh,et al.  A Preliminary Study of Peer-to-Peer Human-Robot Interaction , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[13]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[14]  Sebastian Thrun Toward a framework for human-robot interaction , 2004 .

[15]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[16]  Yasuharu Koike,et al.  PII: S0893-6080(96)00043-3 , 1997 .

[17]  Helge J. Ritter,et al.  Situated robot learning for multi-modal instruction and imitation of grasping , 2004, Robotics Auton. Syst..

[18]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[19]  Eric R. Ziegel,et al.  Practical Nonparametric and Semiparametric Bayesian Statistics , 1998, Technometrics.

[20]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[21]  Barbara Majecka,et al.  Statistical models of pedestrian behaviour in the Forum , 2009 .

[22]  Anind K. Dey,et al.  Human Behavior Modeling with Maximum Entropy Inverse Optimal Control , 2009, AAAI Spring Symposium: Human Behavior Modeling.

[23]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[24]  G. Roberts,et al.  Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler , 1997 .

[25]  Jan Peters,et al.  Movement extraction by detecting dynamics switches and repetitions , 2010, NIPS.

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[28]  P. Damlen,et al.  Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables , 1999 .

[29]  Jean Scholtz,et al.  Awareness in human-robot interactions , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[30]  Pradeep K. Khosla,et al.  A Multi-Agent System for Programming Robotic Agents by Human Demonstration , 1998 .

[31]  R. Bellman Dynamic programming. , 1957, Science.

[32]  Marc Toussaint,et al.  Learned graphical models for probabilistic planning provide a new class of movement primitives , 2013, Front. Comput. Neurosci..

[33]  Petre Stoica,et al.  Decentralized Control , 2018, The Control Systems Handbook.

[34]  Christopher D. Wickens,et al.  A model for types and levels of human interaction with automation , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[35]  Bernhard Schölkopf,et al.  Switched Latent Force Models for Movement Segmentation , 2010, NIPS.

[36]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[37]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[38]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[39]  Daniel H. Grollman,et al.  Dogged Learning for Robots , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[40]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[41]  Claude Sammut,et al.  Learning to Fly , 1992, ML.

[42]  Christos Dimitrakakis,et al.  Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[43]  Manuela M. Veloso,et al.  Confidence-based policy learning from demonstration using Gaussian mixture models , 2007, AAMAS '07.

[44]  Chrystopher L. Nehaniv,et al.  Teaching robots by moulding behavior and scaffolding the environment , 2006, HRI '06.

[45]  Stephen P. Boyd,et al.  Linear Matrix Inequalities in Systems and Control Theory , 1994 .

[46]  Scott Niekum,et al.  Learning and generalization of complex tasks from unstructured demonstrations , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[47]  Paul E. Rybski,et al.  Interactive task training of a mobile robot through human gesture recognition , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).

[48]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[49]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[50]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[51]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[52]  E. Yaz Linear Matrix Inequalities In System And Control Theory , 1998, Proceedings of the IEEE.

[53]  Jonathan P. How,et al.  Actuator Constrained Trajectory Generation and Control for Variable-Pitch Quadrotors , 2012 .

[54]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[55]  Thomas J. Walsh,et al.  Generalizing Apprenticeship Learning across Hypothesis Classes , 2010, ICML.

[56]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[57]  Pamela J. Hinds,et al.  Autonomy and Common Ground in Human-Robot Interaction: A Field Study , 2007, IEEE Intelligent Systems.

[58]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[59]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[60]  Jun Morimoto,et al.  Learning from demonstration and adaptation of biped locomotion , 2004, Robotics Auton. Syst..

[61]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[62]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[63]  B. Bethke,et al.  Real-time indoor autonomous vehicle test environment , 2008, IEEE Control Systems.

[64]  Leslie Pack Kaelbling,et al.  Making Reinforcement Learning Work on Real Robots , 2002 .

[65]  J. Pitman Combinatorial Stochastic Processes , 2006 .

[66]  Erik B. Sudderth Graphical models for visual object recognition and tracking , 2006 .

[67]  José María Valls,et al.  Correcting and improving imitation models of humans for Robosoccer agents , 2005, 2005 IEEE Congress on Evolutionary Computation.

[68]  Nathan Delson,et al.  Robot programming by human demonstration: adaptation and inconsistency in constrained motion , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[69]  Pieter Abbeel,et al.  Apprenticeship learning for helicopter control , 2009, CACM.

[70]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[71]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[72]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[73]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[74]  Emilio Frazzoli,et al.  Steady-state cornering equilibria and stabilisation for a vehicle during extreme operating conditions , 2010 .

[75]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[76]  Gordon Cheng,et al.  Humanoid robot learning and game playing using PC-based vision , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[77]  W. Cleveland,et al.  Smoothing by Local Regression: Principles and Methods , 1996 .

[78]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[79]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[80]  Michael A. Goodrich,et al.  Seven principles of efficient human robot interaction , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[81]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[82]  David M. Bradley,et al.  Boosting Structured Prediction for Imitation Learning , 2006, NIPS.

[83]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[84]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[85]  Ignazio Infantino,et al.  A posture sequence learning system for an anthropomorphic robotic hand , 2004, Robotics Auton. Syst..

[86]  Daniel H. Grollman,et al.  Sparse incremental learning for interactive robot control policy estimation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[87]  Biao Huang,et al.  System Identification , 2000, Control Theory for Physicists.

[88]  Aude Billard,et al.  Incremental learning of gestures by imitation in a humanoid robot , 2007, 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[89]  Andrew G. Barto,et al.  Learning Skills in Reinforcement Learning Using Relative Novelty , 2005, SARA.

[90]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[92]  Thomas J. Walsh,et al.  Teaching and executing verb phrases , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[93]  Thomas Stibor,et al.  Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation , 2010, ACML.

[94]  Jun Nakanishi,et al.  Learning rhythmic movements by demonstration using nonlinear oscillators , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[95]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[96]  Manuela M. Veloso,et al.  Teaching sequential tasks with repetition through demonstration , 2008, AAMAS.

[97]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[98]  Tetsunari Inamura Masayuki Inaba Hirochika Acquisition of Probabilistic Behavior Decision Model based on the Interactive Teaching Method , 2001 .

[99]  Pieter Abbeel,et al.  Apprenticeship learning and reinforcement learning with application to robotic control , 2008 .

[100]  M. Opper Sparse Online Gaussian Processes , 2008 .

[101]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[102]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[103]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[104]  Kee-Eung Kim,et al.  Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions , 2012, NIPS.

[105]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[106]  Pieter Abbeel,et al.  Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion , 2007, NIPS.

[107]  Roderic A. Grupen,et al.  A model of shared grasp affordances from demonstration , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[108]  Dana H. Ballard,et al.  Recognizing teleoperated manipulations , 1993, [1993] Proceedings IEEE International Conference on Robotics and Automation.

[109]  Jonathan P. How,et al.  Scalable reward learning from demonstration , 2013, 2013 IEEE International Conference on Robotics and Automation.

[110]  Avinash C. Kak,et al.  Automatic learning of assembly tasks using a DataGlove system , 1995, Proceedings 1995 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human Robot Interaction and Cooperative Robots.

[111]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[112]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[113]  Sara B. Kiesler,et al.  Fostering common ground in human-robot interaction , 2005, ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005..