Learning Search Strategies from Human Demonstrations

Decision making and planning with partial state information is a problem faced by all forms of intelligent entities. The formulation of a problem under partial state information leads to an exorbitant set of choices with associated probabilistic outcomes making its resolution difficult when using traditional planning methods. Human beings have acquired the ability of acting under uncertainty through education and self-learning. Transferring our know-how to artificial agents and robots will make it faster for them to learn and even improve upon us in tasks in which incomplete knowledge is available, which is the objective of this thesis. We model how humans reason with respect to their beliefs and transfer this knowledge in the form of a parameterised policy, following a Programming by Demonstration framework, to a robot apprentice for two spatial navigation tasks: the first task consists of localising a wooden block on a table and for the second task a power socket must be found and connected. In both tasks the human teacher and robot apprentice only rely on haptic and tactile information. We model the human and robot's beliefs by a probability density function which we update through recursive Bayesian state space estimation. To model the reasoning processes of human subjects performing the search tasks we learn a generative joint distribution over beliefs and actions (end-effector velocities) which were recorded during the executions of the task. For the first search task the direct mapping from belief to actions is learned whilst for the second task we incorporate a cost function used to adapt the policy parameters in a Reinforcement Learning framework and show a considerable improvement over solely learning the behaviour with respect to the distance taken to accomplish the task. Both search tasks above can be considered as active localisation as the uncertainty originates only from the position of the agent in the world. We consider searches in which both the position of the robot and features of the environment are uncertain. Given the unstructured nature of the belief a histogram parametrisation of the joint distribution of the robots position and features is necessary. However, naively doing so becomes quickly intractable as the space and time complexity is exponential. We demonstrate that by only parametrising the marginals and by memorising the parameters of the measurement likelihood functions we can recover the exact same solution as the naive parametrisations at a cost which is linear in space and time complexity.

[1]  Sebastian Thrun,et al.  Simultaneous localization and mapping with unknown data association using FastSLAM , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[2]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[3]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[4]  Aude Billard,et al.  Online learning of varying stiffness through physical human-robot interaction , 2012, 2012 IEEE International Conference on Robotics and Automation.

[5]  Nancy M. Amato,et al.  FIRM: Feedback controller-based information-state roadmap - A framework for motion planning under uncertainty , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  A. Leslie Mapping the mind: ToMM, ToBY, and Agency: Core architecture and domain specificity , 1994 .

[7]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[8]  Sebastian Scherer,et al.  Learning obstacle avoidance parameters from operator behavior , 2006, J. Field Robotics.

[9]  Marc Toussaint,et al.  POMDP manipulation via trajectory optimization , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[11]  B. Sodian,et al.  Theory of Mind , 2010 .

[12]  Wolfram Burgard,et al.  Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .

[13]  Cyrill Stachniss,et al.  Simultaneous Localization and Mapping , 2016, Springer Handbook of Robotics, 2nd Ed..

[14]  Aude Billard,et al.  Learning search polices from humans in a partially observable context , 2014, ROBIO 2014.

[15]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[16]  Peter Vrancx,et al.  Game Theory and Multi-agent Reinforcement Learning , 2012, Reinforcement Learning.

[17]  Dong Chen,et al.  An uncertainty-aware precision grasping process for objects with unknown dimensions , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Jimmy A. Jørgensen,et al.  Transfer of assembly operations to new workpiece poses by adaptation to the desired force profile , 2013, 2013 16th International Conference on Advanced Robotics (ICAR).

[19]  D. Bernoulli Exposition of a New Theory on the Measurement of Risk , 1954 .

[20]  Mary Hegarty,et al.  What determines our navigational abilities? , 2010, Trends in Cognitive Sciences.

[21]  Leslie Pack Kaelbling,et al.  Belief space planning assuming maximum likelihood observations , 2010, Robotics: Science and Systems.

[22]  Kenji Doya,et al.  EM-based policy hyper parameter exploration: application to standing and balancing of a two-wheeled smartphone robot , 2015, Artificial Life and Robotics.

[23]  N. Roy,et al.  The Belief Roadmap: Efficient Planning in Belief Space by Factoring the Covariance , 2009, Int. J. Robotics Res..

[24]  Fillia Makedon,et al.  Approximate planning in POMDPs via MDP heuristic , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[25]  George Apostolakis,et al.  Decision theory , 1986 .

[26]  Stefan Schaal,et al.  Reinforcement Learning With Sequences of Motion Primitives for Robust Manipulation , 2012, IEEE Transactions on Robotics.

[27]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[28]  Stefan Schaal,et al.  Robot Programming by Demonstration , 2009, Springer Handbook of Robotics.

[29]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[30]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[31]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[32]  Nikos A. Vlassis,et al.  Planning with Continuous Actions in Partially Observable Environments , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[33]  Ron Alterovitz,et al.  Motion planning under uncertainty using iterative local optimization in belief space , 2012, Int. J. Robotics Res..

[34]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[35]  Bart De Schutter,et al.  Approximate reinforcement learning: An overview , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[36]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[37]  Stefan Schaal,et al.  Online movement adaptation based on previous sensor experiences , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[39]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[40]  Stefan Schaal,et al.  Learning motion primitive goals for robust manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[43]  Rüdiger Dillmann,et al.  Solving Continuous POMDPs: Value Iteration with Incremental Learning of an Efficient Space Representation , 2013, ICML.

[44]  Pedro U. Lima,et al.  Point-Based POMDP Solving with Factored Value Function Approximation , 2014, AAAI.

[45]  Jiming Liu,et al.  Improving POMDP Tractability via Belief Compression and Clustering , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[46]  Alejandro Agostini,et al.  Reinforcement Learning with a Gaussian mixture model , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[47]  Ron Alterovitz,et al.  Motion planning under uncertainty for medical needle steering using optimization in belief space , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[48]  Roderic A. Grupen,et al.  Learning admittance mappings for force-guided assembly , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[49]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[50]  D. Warner North,et al.  A Tutorial Introduction to Decision Theory , 1968, IEEE Trans. Syst. Sci. Cybern..

[51]  Joshua B. Tenenbaum,et al.  Bayesian models of human action understanding , 2005, NIPS.

[52]  Aude Billard,et al.  Robot learning by demonstration , 2013, Scholarpedia.

[53]  E. Spelke,et al.  Updating egocentric representations in human navigation , 2000, Cognition.

[54]  Nicholas Roy,et al.  Exponential Family PCA for Belief Compression in POMDPs , 2002, NIPS.

[55]  Florian Schmidt,et al.  Sequential trajectory re-planning with tactile information gain for dexterous grasping under object-pose uncertainty , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[56]  William D. Smart,et al.  A Scalable Method for Solving High-Dimensional Continuous POMDPs Using Local Approximation , 2010, UAI.

[57]  Wolfram Burgard,et al.  Gaussian Beam Processes: A Nonparametric Bayesian Measurement Model for Range Finders , 2007, Robotics: Science and Systems.

[58]  Joshua B. Tenenbaum,et al.  Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution , 2011, CogSci.

[59]  Michael S. Branicky,et al.  Search strategies for peg-in-hole assemblies with position uncertainty , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[60]  P. Lavenex,et al.  Human short-term spatial memory: Precision predicts capacity , 2015, Cognitive Psychology.

[61]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[62]  Kamal K. Gupta,et al.  RRT-SLAM for motion planning with motion and map uncertainty for robot exploration , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[63]  Alex Brooks,et al.  A Monte Carlo Update for Parametric POMDPs , 2007, ISRR.

[64]  Danica Kragic,et al.  Dexterous grasping under shape uncertainty , 2016, Robotics Auton. Syst..

[65]  Sebastian Thrun,et al.  FastSLAM 2.0: an improved particle filtering algorithm for simultaneous localization and mapping that provably converges , 2003, IJCAI 2003.

[66]  Klas Kronander,et al.  Control and Learning of Compliant Manipulation Skills , 2015 .

[67]  Aude Billard,et al.  Non-Parametric Bayesian State Space Estimator for Negative Information , 2017, Front. Robot. AI.

[68]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[69]  Héctor H. González-Baños,et al.  Navigation Strategies for Exploring Indoor Environments , 2002, Int. J. Robotics Res..

[70]  Wolfram Burgard,et al.  Information Gain-based Exploration Using Rao-Blackwellized Particle Filters , 2005, Robotics: Science and Systems.

[71]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[72]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[73]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[74]  Holger Voos,et al.  Controller design for quadrotor UAVs using reinforcement learning , 2010, 2010 IEEE International Conference on Control Applications.

[75]  J. Neumann,et al.  The Theory of Games and Economic Behaviour , 1944 .

[76]  Aude Billard,et al.  Learning search behaviour from humans , 2013, 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[77]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[78]  Balaraman Ravindran,et al.  Where do i look now? Gaze allocation during visually guided manipulation , 2012, 2012 IEEE International Conference on Robotics and Automation.

[79]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[80]  Jaime Valls Miró,et al.  Active Pose SLAM , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[81]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[82]  Seung-kook Yun,et al.  Compliant manipulation for peg-in-hole: Is passive compliance a key to learn contact motion? , 2008, 2008 IEEE International Conference on Robotics and Automation.

[83]  H. Sung Gaussian Mixture Regression and Classification , 2004 .

[84]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[85]  Martin A. Riedmiller,et al.  Deep auto-encoder neural networks in reinforcement learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[86]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[87]  Mohamad Bdiwi,et al.  Improved peg-in-hole (5-pin plug) task: Intended for charging electric vehicles by robot system automatically , 2015, 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15).

[88]  M. Proulx,et al.  Visual experience facilitates allocentric spatial representation , 2013, Behavioural Brain Research.

[89]  Aude Billard,et al.  Learning from failed demonstrations in unreliable systems , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[90]  Sandra Hirche,et al.  Risk-sensitive interaction control in uncertain manipulation tasks , 2013, 2013 IEEE International Conference on Robotics and Automation.

[91]  W. Fisher,et al.  Hybrid Position/Force Control: A Correct Formulation , 1992 .

[92]  Wolfram Burgard,et al.  Coastal navigation-mobile robot navigation with uncertainty in dynamic environments , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).

[93]  N. Burgess,et al.  Spatial memory: how egocentric and allocentric combine , 2006, Trends in Cognitive Sciences.

[94]  P. Abbeel,et al.  LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information , 2011 .

[95]  Juan Andrade-Cetto,et al.  Dense entropy decrease estimation for mobile robot exploration , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[96]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[97]  Gang Niu,et al.  Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation , 2015, ACML.

[98]  Heping Chen,et al.  Online parameter optimization in robotic force controlled assembly processes , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[99]  Ales Ude,et al.  Solving peg-in-hole tasks by human demonstration and exception strategies , 2014 .

[100]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[101]  Sebastian Thrun,et al.  Particle Filters in Robotics , 2002, UAI.

[102]  Daniel Vélez Día,et al.  Biomechanics and Motor Control of Human Movement , 2013 .

[103]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[104]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[105]  Kurt Konolige,et al.  Autonomous door opening and plugging in with a personal robot , 2010, 2010 IEEE International Conference on Robotics and Automation.

[106]  L. L. Lin,et al.  Fast programming of Peg-in-hole Actions by human demonstration , 2014, 2014 International Conference on Mechatronics and Control (ICMC).

[107]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[108]  Peter W. Glynn,et al.  Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice , 2000, NIPS.

[109]  Joshua B. Tenenbaum,et al.  The Development of Joint Belief-Desire Inferences , 2012, CogSci.

[110]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[111]  Peter N. C. Mohr,et al.  Decision making under uncertainty , 2013, Front. Neurosci..

[112]  Christian Büchel,et al.  Spatial updating: how the brain keeps track of changing object locations during observer motion , 2008, Nature Neuroscience.

[113]  Wee Sun Lee,et al.  A POMDP Approach to Robot Motion Planning under Uncertainty , 2010 .

[114]  Gordon E Legge,et al.  Lost in virtual space: studies in human and ideal spatial navigation. , 2006, Journal of experimental psychology. Human perception and performance.

[115]  Brian Scassellati,et al.  Theory of Mind for a Humanoid Robot , 2002, Auton. Robots.

[116]  Nancy M. Amato,et al.  Robust online belief space planning in changing environments: Application to physical mobile robots , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[117]  Leslie Pack Kaelbling,et al.  Non-Gaussian belief space planning: Correctness and complexity , 2012, 2012 IEEE International Conference on Robotics and Automation.

[118]  Wolfram Burgard,et al.  A Tutorial on Graph-Based SLAM , 2010, IEEE Intelligent Transportation Systems Magazine.

[119]  Sebastian Thrun,et al.  Integrating Grid-Based and Topological Maps for Mobile Robot Navigation , 1996, AAAI/IAAI, Vol. 2.

[120]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[121]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[122]  David Silver,et al.  Learning from Demonstration for Autonomous Navigation in Complex Unstructured Terrain , 2010, Int. J. Robotics Res..

[123]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[124]  Moonhong Baeg,et al.  Intuitive peg-in-hole assembly strategy with a compliant manipulator , 2013, IEEE ISR 2013.